

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />

  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  
  <title>健康检查 &mdash; Ceph Documentation</title>
  

  
  <link rel="stylesheet" href="../../../_static/ceph.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/ceph.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/graphviz.css" type="text/css" />
  <link rel="stylesheet" href="../../../_static/css/custom.css" type="text/css" />

  
  

  
  

  

  
  <!--[if lt IE 9]>
    <script src="../../../_static/js/html5shiv.min.js"></script>
  <![endif]-->
  
    
      <script type="text/javascript" id="documentation_options" data-url_root="../../../" src="../../../_static/documentation_options.js"></script>
        <script src="../../../_static/jquery.js"></script>
        <script src="../../../_static/_sphinx_javascript_frameworks_compat.js"></script>
        <script data-url_root="../../../" id="documentation_options" src="../../../_static/documentation_options.js"></script>
        <script src="../../../_static/doctools.js"></script>
        <script src="../../../_static/sphinx_highlight.js"></script>
    
    <script type="text/javascript" src="../../../_static/js/theme.js"></script>

    
    <link rel="index" title="Index" href="../../../genindex/" />
    <link rel="search" title="Search" href="../../../search/" />
    <link rel="next" title="监控集群" href="../monitoring/" />
    <link rel="prev" title="操纵集群" href="../operating/" /> 
</head>

<body class="wy-body-for-nav">

   
  <header class="top-bar">
    <div role="navigation" aria-label="Page navigation">
  <ul class="wy-breadcrumbs">
      <li><a href="../../../" class="icon icon-home" aria-label="Home"></a></li>
          <li class="breadcrumb-item"><a href="../../">Ceph 存储集群</a></li>
          <li class="breadcrumb-item"><a href="../">集群运维</a></li>
      <li class="breadcrumb-item active">健康检查</li>
      <li class="wy-breadcrumbs-aside">
            <a href="../../../_sources/rados/operations/health-checks.rst.txt" rel="nofollow"> View page source</a>
      </li>
  </ul>
  <hr/>
</div>
  </header>
  <div class="wy-grid-for-nav">
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search"  style="background: #eee" >
          

          
            <a href="../../../" class="icon icon-home"> Ceph
          

          
          </a>

          

          
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../../search/" method="get">
    <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>

          
        </div>

        
        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
            
              
            
            
              <ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../../../start/">Ceph 简介</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../install/">安装 Ceph</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../cephadm/">Cephadm</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../../">Ceph 存储集群</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="../../configuration/">配置</a></li>
<li class="toctree-l2 current"><a class="reference internal" href="../">运维</a><ul class="current">
<li class="toctree-l3"><a class="reference internal" href="../operating/">操纵集群</a></li>
<li class="toctree-l3 current"><a class="current reference internal" href="#">健康检查</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#id2">概览</a></li>
<li class="toctree-l4"><a class="reference internal" href="#id3">状态定义</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="../monitoring/">监控集群</a></li>
<li class="toctree-l3"><a class="reference internal" href="../monitoring-osd-pg/">监控 OSD 和归置组</a></li>
<li class="toctree-l3"><a class="reference internal" href="../user-management/">用户管理</a></li>
<li class="toctree-l3"><a class="reference internal" href="../pgcalc/">PG Calc</a></li>
<li class="toctree-l3"><a class="reference internal" href="../data-placement/">数据归置概览</a></li>
<li class="toctree-l3"><a class="reference internal" href="../pools/">存储池</a></li>
<li class="toctree-l3"><a class="reference internal" href="../erasure-code/">纠删码</a></li>
<li class="toctree-l3"><a class="reference internal" href="../cache-tiering/">分级缓存</a></li>
<li class="toctree-l3"><a class="reference internal" href="../placement-groups/">归置组</a></li>
<li class="toctree-l3"><a class="reference internal" href="../upmap/">使用 pg-upmap</a></li>
<li class="toctree-l3"><a class="reference internal" href="../read-balancer/">Operating the Read (Primary) Balancer</a></li>
<li class="toctree-l3"><a class="reference internal" href="../balancer/">均衡器模块</a></li>
<li class="toctree-l3"><a class="reference internal" href="../crush-map/">CRUSH 图</a></li>
<li class="toctree-l3"><a class="reference internal" href="../crush-map-edits/">手动编辑一个 CRUSH 图</a></li>
<li class="toctree-l3"><a class="reference internal" href="../stretch-mode/">Stretch Clusters</a></li>
<li class="toctree-l3"><a class="reference internal" href="../change-mon-elections/">Configuring Monitor Election Strategies</a></li>
<li class="toctree-l3"><a class="reference internal" href="../add-or-rm-osds/">增加/删除 OSD</a></li>
<li class="toctree-l3"><a class="reference internal" href="../add-or-rm-mons/">增加/删除监视器</a></li>
<li class="toctree-l3"><a class="reference internal" href="../devices/">设备管理</a></li>
<li class="toctree-l3"><a class="reference internal" href="../bluestore-migration/">迁移到 BlueStore</a></li>
<li class="toctree-l3"><a class="reference internal" href="../control/">命令参考</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/community/">Ceph 社区</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/troubleshooting-mon/">监视器故障排除</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/troubleshooting-osd/">OSD 故障排除</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/troubleshooting-pg/">归置组排障</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/log-and-debug/">日志记录和调试</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/cpu-profiling/">CPU 剖析</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../troubleshooting/memory-profiling/">内存剖析</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../../man/">    手册页</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../troubleshooting/">故障排除</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../api/">APIs</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../../cephfs/">Ceph 文件系统</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../rbd/">Ceph 块设备</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../radosgw/">Ceph 对象网关</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../mgr/">Ceph 管理器守护进程</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../mgr/dashboard/">Ceph 仪表盘</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../monitoring/">监控概览</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../api/">API 文档</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../architecture/">体系结构</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../dev/developer_guide/">开发者指南</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../dev/internals/">Ceph 内幕</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../governance/">项目管理</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../foundation/">Ceph 基金会</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../ceph-volume/">ceph-volume</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../releases/general/">Ceph 版本（总目录）</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../releases/">Ceph 版本（索引）</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../security/">Security</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../hardware-monitoring/">硬件监控</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../glossary/">Ceph 术语</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../jaegertracing/">Tracing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../translation_cn/">中文版翻译资源</a></li>
</ul>

            
          
        </div>
        
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" aria-label="top navigation">
        
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../../">Ceph</a>
        
      </nav>


      <div class="wy-nav-content">
        
        <div class="rst-content">
        
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
<div id="dev-warning" class="admonition note">
  <p class="first admonition-title">Notice</p>
  <p class="last">This document is for a development version of Ceph.</p>
</div>
  <div id="docubetter" align="right" style="padding: 5px; font-weight: bold;">
    <a href="https://pad.ceph.com/p/Report_Documentation_Bugs">Report a Documentation Bug</a>
  </div>

  
  <section id="health-checks">
<span id="id1"></span><h1>健康检查<a class="headerlink" href="#health-checks" title="Permalink to this heading"></a></h1>
<section id="id2">
<h2>概览<a class="headerlink" href="#id2" title="Permalink to this heading"></a></h2>
<p>Ceph 集群可能产生的健康消息是有限的——它们通通被定义为<em>健康检查</em>，都有唯一标识符。</p>
<p>The identifier is a terse human-readable string -- that is, the identifier is
readable in much the same way as a typical variable name. It is intended to
enable tools (for example, monitoring and UIs) to make sense of health checks and present them
in a way that reflects their meaning.</p>
<p>This page lists the health checks that are raised by the monitor and manager
daemons. In addition to these, you may see health checks that originate
from CephFS MDS daemons (see <a class="reference internal" href="../../../cephfs/health-messages/#cephfs-health-messages"><span class="std std-ref">CephFS 健康消息</span></a>), and health checks
that are defined by <code class="docutils literal notranslate"><span class="pre">ceph-mgr</span></code> modules.</p>
</section>
<section id="id3">
<h2>状态定义<a class="headerlink" href="#id3" title="Permalink to this heading"></a></h2>
<section id="id4">
<h3>监视器<a class="headerlink" href="#id4" title="Permalink to this heading"></a></h3>
<section id="daemon-old-version">
<h4>DAEMON_OLD_VERSION<a class="headerlink" href="#daemon-old-version" title="Permalink to this heading"></a></h4>
<p>One or more Ceph daemons are running an old Ceph release.  A health check is
raised if multiple versions are detected.  This condition must exist for a
period of time greater than <code class="docutils literal notranslate"><span class="pre">mon_warn_older_version_delay</span></code> (set to one week
by default) in order for the health check to be raised. This allows most
upgrades to proceed without raising a warning that is both expected and
ephemeral. If the upgrade is paused for an extended time, <code class="docutils literal notranslate"><span class="pre">health</span> <span class="pre">mute</span></code> can
be used by running <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">mute</span> <span class="pre">DAEMON_OLD_VERSION</span> <span class="pre">--sticky</span></code>. Be sure,
however, to run <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">unmute</span> <span class="pre">DAEMON_OLD_VERSION</span></code> after the upgrade has
finished so that any future, unexpected instances are not masked.</p>
</section>
<section id="mon-down">
<h4>MON_DOWN<a class="headerlink" href="#mon-down" title="Permalink to this heading"></a></h4>
<p>One or more Ceph Monitor daemons are down. The cluster requires a majority
(more than one-half) of the provsioned monitors to be available. When one or
more monitors are down, clients may have a harder time forming their initial
connection to the cluster, as they may need to try additional IP addresses
before they reach an operating monitor.</p>
<p>Down monitor daemons should be restored or restarted as soon as possible to
reduce the risk that an additional monitor failure may cause a service outage.</p>
</section>
<section id="mon-clock-skew">
<h4>MON_CLOCK_SKEW<a class="headerlink" href="#mon-clock-skew" title="Permalink to this heading"></a></h4>
<p>The clocks on hosts running Ceph Monitor daemons are not well-synchronized.
This health check is raised if the cluster detects a clock skew greater than
<code class="docutils literal notranslate"><span class="pre">mon_clock_drift_allowed</span></code>.</p>
<p>This issue is best resolved by synchronizing the clocks by using a tool like
the legacy <code class="docutils literal notranslate"><span class="pre">ntpd</span></code> or the newer <code class="docutils literal notranslate"><span class="pre">chrony</span></code>.  It is ideal to configure NTP
daemons to sync against multiple internal and external sources for resilience;
the protocol will adaptively determine the best available source.  It is also
beneficial to have the NTP daemons on Ceph Monitor hosts sync against each
other, as it is even more important that Monitors be synchronized with each
other than it is for them to be _correct_ with respect to reference time.</p>
<p>If it is impractical to keep the clocks closely synchronized, the
<code class="docutils literal notranslate"><span class="pre">mon_clock_drift_allowed</span></code> threshold can be increased. However, this value
must stay significantly below the <code class="docutils literal notranslate"><span class="pre">mon_lease</span></code> interval in order for the
monitor cluster to function properly.  It is not difficult with a quality NTP
or PTP configuration to have sub-millisecond synchronization, so there are
very, very few occasions when it is appropriate to change this value.</p>
</section>
<section id="mon-msgr2-not-enabled">
<h4>MON_MSGR2_NOT_ENABLED<a class="headerlink" href="#mon-msgr2-not-enabled" title="Permalink to this heading"></a></h4>
<p>The <code class="xref std std-confval docutils literal notranslate"><span class="pre">ms_bind_msgr2</span></code> option is enabled but one or more monitors are not
configured in the cluster’s monmap to bind to a v2 port. This means that
features specific to the msgr2 protocol (for example, encryption) are
unavailable on some or all connections.</p>
<p>In most cases this can be corrected by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><style type="text/css">
span.prompt1:before {
  content: "$ ";
}
</style><span class="prompt1">ceph<span class="w"> </span>mon<span class="w"> </span>enable-msgr2</span>
</pre></div></div><p>After this command is run, any monitor configured to listen on the old default
port (6789) will continue to listen for v1 connections on 6789 and begin to
listen for v2 connections on the new default port 3300.</p>
<p>If a monitor is configured to listen for v1 connections on a non-standard port
(that is, a port other than 6789), the monmap will need to be modified
manually.</p>
</section>
<section id="mon-disk-low">
<h4>MON_DISK_LOW<a class="headerlink" href="#mon-disk-low" title="Permalink to this heading"></a></h4>
<p>One or more monitors are low on storage space. This health check is raised if
the percentage of available space on the file system used by the monitor
database (normally <code class="docutils literal notranslate"><span class="pre">/var/lib/ceph/mon</span></code>) drops below the percentage value
<code class="docutils literal notranslate"><span class="pre">mon_data_avail_warn</span></code> (default: 30%).</p>
<p>This alert might indicate that some other process or user on the system is
filling up the file system used by the monitor. It might also indicate that the
monitor database is too large (see <code class="docutils literal notranslate"><span class="pre">MON_DISK_BIG</span></code> below).  Another common
scenario is that Ceph logging subsystem levels have been raised for
troubleshooting purposes without subsequent return to default levels.  Ongoing
verbose logging can easily fill up the files system containing <code class="docutils literal notranslate"><span class="pre">/var/log</span></code>. If
you trim logs that are currently open, remember to restart or instruct your
syslog or other daemon to re-open the log file.</p>
<p>If space cannot be freed, the monitor’s data directory might need to be moved
to another storage device or file system (this relocation process must be
carried out while the monitor daemon is not running).</p>
</section>
<section id="mon-disk-crit">
<h4>MON_DISK_CRIT<a class="headerlink" href="#mon-disk-crit" title="Permalink to this heading"></a></h4>
<p>One or more monitors are critically low on storage space. This health check is
raised if the percentage of available space on the file system used by the
monitor database (normally <code class="docutils literal notranslate"><span class="pre">/var/lib/ceph/mon</span></code>) drops below the percentage
value <code class="docutils literal notranslate"><span class="pre">mon_data_avail_crit</span></code> (default: 5%). See <code class="docutils literal notranslate"><span class="pre">MON_DISK_LOW</span></code>, above.</p>
</section>
<section id="mon-disk-big">
<h4>MON_DISK_BIG<a class="headerlink" href="#mon-disk-big" title="Permalink to this heading"></a></h4>
<p>The database size for one or more monitors is very large. This health check is
raised if the size of the monitor database is larger than
<code class="docutils literal notranslate"><span class="pre">mon_data_size_warn</span></code> (default: 15 GiB).</p>
<p>A large database is unusual, but does not necessarily indicate a problem.
Monitor databases might grow in size when there are placement groups that have
not reached an <code class="docutils literal notranslate"><span class="pre">active+clean</span></code> state in a long time, or when extensive cluster
recovery, expansion, or topology changes have recently occurred.</p>
<p>This alert may also indicate that the monitor’s database is not properly
compacting, an issue that has been observed with some older versions of
RocksDB. Forcing compaction with <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">daemon</span> <span class="pre">mon.&lt;id&gt;</span> <span class="pre">compact</span></code> may suffice
to shrink the database’s storage usage.</p>
<p>This alert may also indicate that the monitor has a bug that prevents it from
pruning the cluster metadata that it stores. If the problem persists, please
report a bug.</p>
<p>To adjust the warning threshold, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>mon_data_size_warn<span class="w"> </span>&lt;size&gt;</span>
</pre></div></div></section>
<section id="auth-insecure-global-id-reclaim">
<h4>AUTH_INSECURE_GLOBAL_ID_RECLAIM<a class="headerlink" href="#auth-insecure-global-id-reclaim" title="Permalink to this heading"></a></h4>
<p>One or more clients or daemons that are connected to the cluster are not
securely reclaiming their <code class="docutils literal notranslate"><span class="pre">global_id</span></code> (a unique number that identifies each
entity in the cluster) when reconnecting to a monitor. The client is being
permitted to connect anyway because the
<code class="docutils literal notranslate"><span class="pre">auth_allow_insecure_global_id_reclaim</span></code> option is set to <code class="docutils literal notranslate"><span class="pre">true</span></code> (which may
be necessary until all Ceph clients have been upgraded) and because the
<code class="docutils literal notranslate"><span class="pre">auth_expose_insecure_global_id_reclaim</span></code> option is set to <code class="docutils literal notranslate"><span class="pre">true</span></code> (which
allows monitors to detect clients with “insecure reclaim” sooner by forcing
those clients to reconnect immediately after their initial authentication).</p>
<p>To identify which client(s) are using unpatched Ceph client code, run the
following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>health<span class="w"> </span>detail</span>
</pre></div></div><p>If you collect a dump of the clients that are connected to an individual
monitor and examine the <code class="docutils literal notranslate"><span class="pre">global_id_status</span></code> field in the output of the dump,
you can see the <code class="docutils literal notranslate"><span class="pre">global_id</span></code> reclaim behavior of those clients. Here
<code class="docutils literal notranslate"><span class="pre">reclaim_insecure</span></code> means that a client is unpatched and is contributing to
this health check.  To effect a client dump, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>tell<span class="w"> </span>mon.<span class="se">\*</span><span class="w"> </span>sessions</span>
</pre></div></div><p>We strongly recommend that all clients in the system be upgraded to a newer
version of Ceph that correctly reclaims <code class="docutils literal notranslate"><span class="pre">global_id</span></code> values. After all clients
have been updated, run the following command to stop allowing insecure
reconnections:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mon<span class="w"> </span>auth_allow_insecure_global_id_reclaim<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div><p>If it is impractical to upgrade all clients immediately, you can temporarily
silence this alert by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>health<span class="w"> </span>mute<span class="w"> </span>AUTH_INSECURE_GLOBAL_ID_RECLAIM<span class="w"> </span>1w<span class="w">   </span><span class="c1"># 1 week</span></span>
</pre></div></div><p>Although we do NOT recommend doing so, you can also disable this alert
indefinitely by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mon<span class="w"> </span>mon_warn_on_insecure_global_id_reclaim<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="auth-insecure-global-id-reclaim-allowed">
<h4>AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED<a class="headerlink" href="#auth-insecure-global-id-reclaim-allowed" title="Permalink to this heading"></a></h4>
<p>Ceph is currently configured to allow clients that reconnect to monitors using
an insecure process to reclaim their previous <code class="docutils literal notranslate"><span class="pre">global_id</span></code>. Such reclaiming is
allowed because, by default, <code class="docutils literal notranslate"><span class="pre">auth_allow_insecure_global_id_reclaim</span></code> is set
to <code class="docutils literal notranslate"><span class="pre">true</span></code>. It might be necessary to leave this setting enabled while existing
Ceph clients are upgraded to newer versions of Ceph that correctly and securely
reclaim their <code class="docutils literal notranslate"><span class="pre">global_id</span></code>.</p>
<p>If the <code class="docutils literal notranslate"><span class="pre">AUTH_INSECURE_GLOBAL_ID_RECLAIM</span></code> health check has not also been
raised and if the <code class="docutils literal notranslate"><span class="pre">auth_expose_insecure_global_id_reclaim</span></code> setting has not
been disabled (it is enabled by default), then there are currently no clients
connected that need to be upgraded. In that case, it is safe to disable
<code class="docutils literal notranslate"><span class="pre">insecure</span> <span class="pre">global_id</span> <span class="pre">reclaim</span></code> by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mon<span class="w"> </span>auth_allow_insecure_global_id_reclaim<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div><p>On the other hand, if there are still clients that need to be upgraded, then
this alert can be temporarily silenced by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>health<span class="w"> </span>mute<span class="w"> </span>AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED<span class="w"> </span>1w<span class="w">   </span><span class="c1"># 1 week</span></span>
</pre></div></div><p>Although we do NOT recommend doing so, you can also disable this alert
indefinitely by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mon<span class="w"> </span>mon_warn_on_insecure_global_id_reclaim_allowed<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
</section>
<section id="id5">
<h3>管理器<a class="headerlink" href="#id5" title="Permalink to this heading"></a></h3>
<section id="mgr-down">
<h4>MGR_DOWN<a class="headerlink" href="#mgr-down" title="Permalink to this heading"></a></h4>
<p>All Ceph Manager daemons are currently down. The cluster should normally have
at least one running manager (<code class="docutils literal notranslate"><span class="pre">ceph-mgr</span></code>) daemon. If no manager daemon is
running, the cluster’s ability to monitor itself will be compromised, parts of
the management API will become unavailable (for example, the dashboard will not
work, and most CLI commands that report metrics or runtime state will block).
However, the cluster will still be able to perform client I/O operations and
recover from failures.</p>
<p>The down manager daemon(s) should be restarted as soon as possible to ensure
that the cluster can be monitored (for example, so that <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">-s</span></code> information
is available and up to date, and so that metrics can be scraped by Prometheus).</p>
</section>
<section id="mgr-module-dependency">
<h4>MGR_MODULE_DEPENDENCY<a class="headerlink" href="#mgr-module-dependency" title="Permalink to this heading"></a></h4>
<p>An enabled manager module is failing its dependency check. This health check
typically comes with an explanatory message from the module about the problem.</p>
<p>For example, a module might report that a required package is not installed: in
this case, you should install the required package and restart your manager
daemons.</p>
<p>This health check is applied only to enabled modules. If a module is not
enabled, you can see whether it is reporting dependency issues in the output of
<cite>ceph module ls</cite>.</p>
</section>
<section id="mgr-module-error">
<h4>MGR_MODULE_ERROR<a class="headerlink" href="#mgr-module-error" title="Permalink to this heading"></a></h4>
<p>A manager module has experienced an unexpected error. Typically, this means
that an unhandled exception was raised from the module’s <cite>serve</cite> function. The
human-readable description of the error might be obscurely worded if the
exception did not provide a useful description of itself.</p>
<p>This health check might indicate a bug: please open a Ceph bug report if you
think you have encountered a bug.</p>
<p>However, if you believe the error is transient, you may restart your manager
daemon(s) or use <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">mgr</span> <span class="pre">fail</span></code> on the active daemon in order to force
failover to another daemon.</p>
</section>
</section>
<section id="osds">
<h3>OSDs<a class="headerlink" href="#osds" title="Permalink to this heading"></a></h3>
<section id="osd-down">
<h4>OSD_DOWN<a class="headerlink" href="#osd-down" title="Permalink to this heading"></a></h4>
<p>至少有一个 OSD 被标记成了 down 状态，其 ceph-osd 守护进程可能已经停掉了、或者是对端 OSD 与此 OSD 之间的网络不通。常见起因有守护进程停止或崩溃、主机挂了、或者网络中断。</p>
<p>核实一下此主机是否健康、守护进程是否启动、网络是否正常。如果那个守护进程崩溃了，其守护进程日志文件（
<code class="docutils literal notranslate"><span class="pre">/var/log/ceph/ceph-osd.*</span></code> ）里会包含调试信息。</p>
</section>
<section id="osd-crush-type-down">
<h4>OSD_&lt;crush type&gt;_DOWN<a class="headerlink" href="#osd-crush-type-down" title="Permalink to this heading"></a></h4>
<p>(例如 OSD_HOST_DOWN, OSD_ROOT_DOWN)</p>
<p>某一个 CRUSH 子树里的所有 OSD 都被标记成 down 了，例如一台主机上的所有 OSD 。</p>
</section>
<section id="osd-orphan">
<h4>OSD_ORPHAN<a class="headerlink" href="#osd-orphan" title="Permalink to this heading"></a></h4>
<p>CRUSH 图分级结构里提到了这个 OSD ，但它并不存在。</p>
<p>CRUSH 图分级结构里的这个 OSD 可以用以下命令删除：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>crush<span class="w"> </span>rm<span class="w"> </span>osd.&lt;id&gt;</span>
</pre></div></div></section>
<section id="osd-out-of-order-full">
<h4>OSD_OUT_OF_ORDER_FULL<a class="headerlink" href="#osd-out-of-order-full" title="Permalink to this heading"></a></h4>
<p>The utilization thresholds for <cite>nearfull</cite>, <cite>backfillfull</cite>, <cite>full</cite>, and/or
<cite>failsafe_full</cite> are not ascending. In particular, the following pattern is
expected: <cite>nearfull &lt; backfillfull</cite>, <cite>backfillfull &lt; full</cite>, and <cite>full &lt;
failsafe_full</cite>.  This can result in unexpected cluster behavior.</p>
<p>To adjust these utilization thresholds, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-nearfull-ratio<span class="w"> </span>&lt;ratio&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-backfillfull-ratio<span class="w"> </span>&lt;ratio&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-full-ratio<span class="w"> </span>&lt;ratio&gt;</span>
</pre></div></div></section>
<section id="osd-full">
<h4>OSD_FULL<a class="headerlink" href="#osd-full" title="Permalink to this heading"></a></h4>
<p>One or more OSDs have exceeded the <cite>full</cite> threshold and are preventing the
cluster from servicing writes.</p>
<p>To check utilization by pool, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>df</span>
</pre></div></div><p>To see the currently defined <cite>full</cite> ratio, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>dump<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>full_ratio</span>
</pre></div></div><p>A short-term workaround to restore write availability is to raise the full
threshold by a small amount. To do so, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-full-ratio<span class="w"> </span>&lt;ratio&gt;</span>
</pre></div></div><p>Additional OSDs should be deployed within appropriate CRUSH failure domains
in order to increase capacity, and / or existing data should be deleted
in order to free up space in the cluster.  One subtle situation is that the
<code class="docutils literal notranslate"><span class="pre">rados</span> <span class="pre">bench</span></code> tool may have been used to test one or more pools’ performance,
and the resulting RADOS objects were not subsequently cleaned up.  You may
check for this by invoking <code class="docutils literal notranslate"><span class="pre">rados</span> <span class="pre">ls</span></code> against each pool and looking for
objects with names beginning with <code class="docutils literal notranslate"><span class="pre">bench</span></code> or other job names.  These may
then be manually but very, very carefully deleted in order to reclaim capacity.</p>
</section>
<section id="osd-backfillfull">
<h4>OSD_BACKFILLFULL<a class="headerlink" href="#osd-backfillfull" title="Permalink to this heading"></a></h4>
<p>One or more OSDs have exceeded the <cite>backfillfull</cite> threshold or <em>would</em> exceed
it if the currently-mapped backfills were to finish, which will prevent data
from rebalancing to this OSD. This alert is an early warning that
rebalancing might be unable to complete and that the cluster is approaching
full.</p>
<p>To check utilization by pool, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>df</span>
</pre></div></div></section>
<section id="osd-nearfull">
<h4>OSD_NEARFULL<a class="headerlink" href="#osd-nearfull" title="Permalink to this heading"></a></h4>
<p>One or more OSDs have exceeded the <cite>nearfull</cite> threshold. This alert is an early
warning that the cluster is approaching full.</p>
<p>To check utilization by pool, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>df</span>
</pre></div></div></section>
<section id="osdmap-flags">
<h4>OSDMAP_FLAGS<a class="headerlink" href="#osdmap-flags" title="Permalink to this heading"></a></h4>
<p>One or more cluster flags of interest have been set. These flags include:</p>
<ul class="simple">
<li><p><em>full</em> - the cluster is flagged as full and cannot serve writes</p></li>
<li><p><em>pauserd</em>, <em>pausewr</em> - there are paused reads or writes</p></li>
<li><p><em>noup</em> - OSDs are not allowed to start</p></li>
<li><p><em>nodown</em> - OSD failure reports are being ignored, and that means that the
monitors will not mark OSDs “down”</p></li>
<li><p><em>noin</em> - OSDs that were previously marked <code class="docutils literal notranslate"><span class="pre">out</span></code> are not being marked
back <code class="docutils literal notranslate"><span class="pre">in</span></code> when they start</p></li>
<li><p><em>noout</em> - “down” OSDs are not automatically being marked <code class="docutils literal notranslate"><span class="pre">out</span></code> after the
configured interval</p></li>
<li><p><em>nobackfill</em>, <em>norecover</em>, <em>norebalance</em> - recovery or data
rebalancing is suspended</p></li>
<li><p><em>noscrub</em>, <em>nodeep_scrub</em> - scrubbing is disabled</p></li>
<li><p><em>notieragent</em> - cache-tiering activity is suspended</p></li>
</ul>
<p>With the exception of <em>full</em>, these flags can be set or cleared by running the
following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;flag&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span><span class="nb">unset</span><span class="w"> </span>&lt;flag&gt;</span>
</pre></div></div></section>
<section id="osd-flags">
<h4>OSD_FLAGS<a class="headerlink" href="#osd-flags" title="Permalink to this heading"></a></h4>
<p>One or more OSDs or CRUSH {nodes,device classes} have a flag of interest set.
These flags include:</p>
<ul class="simple">
<li><p><em>noup</em>: these OSDs are not allowed to start</p></li>
<li><p><em>nodown</em>: failure reports for these OSDs will be ignored</p></li>
<li><p><em>noin</em>: if these OSDs were previously marked <code class="docutils literal notranslate"><span class="pre">out</span></code> automatically
after a failure, they will not be marked <code class="docutils literal notranslate"><span class="pre">in</span></code> when they start</p></li>
<li><p><em>noout</em>: if these OSDs are “down” they will not automatically be marked
<code class="docutils literal notranslate"><span class="pre">out</span></code> after the configured interval</p></li>
</ul>
<p>这些标记可以这样批量设置和清除：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-group<span class="w"> </span>&lt;flags&gt;<span class="w"> </span>&lt;who&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>unset-group<span class="w"> </span>&lt;flags&gt;<span class="w"> </span>&lt;who&gt;</span>
</pre></div></div><p>例如：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-group<span class="w"> </span>noup,noout<span class="w"> </span>osd.0<span class="w"> </span>osd.1</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>unset-group<span class="w"> </span>noup,noout<span class="w"> </span>osd.0<span class="w"> </span>osd.1</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-group<span class="w"> </span>noup,noout<span class="w"> </span>host-foo</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>unset-group<span class="w"> </span>noup,noout<span class="w"> </span>host-foo</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>set-group<span class="w"> </span>noup,noout<span class="w"> </span>class-hdd</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>unset-group<span class="w"> </span>noup,noout<span class="w"> </span>class-hdd</span>
</pre></div></div></section>
<section id="old-crush-tunables">
<h4>OLD_CRUSH_TUNABLES<a class="headerlink" href="#old-crush-tunables" title="Permalink to this heading"></a></h4>
<p>CRUSH 图在使用很老的选项，应该更新它。还能使用（即，可连接此集群的最老客户端版本号）而不会触发此健康告警的最老可调选项由
<code class="docutils literal notranslate"><span class="pre">mon_crush_min_required_version</span></code> 配置选项决定。
详情见 ref:<cite>crush-map-tunables</cite> 。</p>
</section>
<section id="old-crush-straw-calc-version">
<h4>OLD_CRUSH_STRAW_CALC_VERSION<a class="headerlink" href="#old-crush-straw-calc-version" title="Permalink to this heading"></a></h4>
<p>CRUSH 图在使用一个比较老的、非最优方法为 <code class="docutils literal notranslate"><span class="pre">straw</span></code> 桶计算中间权重值。</p>
<p>The CRUSH map should be updated to use the newer method (that is:
<code class="docutils literal notranslate"><span class="pre">straw_calc_version=1</span></code>). For more information, see <a class="reference internal" href="../crush-map/#crush-map-tunables"><span class="std std-ref">可调选项</span></a>.</p>
</section>
<section id="cache-pool-no-hit-set">
<h4>CACHE_POOL_NO_HIT_SET<a class="headerlink" href="#cache-pool-no-hit-set" title="Permalink to this heading"></a></h4>
<p>One or more cache pools are not configured with a <em>hit set</em> to track
utilization. This issue prevents the tiering agent from identifying cold
objects that are to be flushed and evicted from the cache.</p>
<p>To configure hit sets on the cache pool, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;poolname&gt;<span class="w"> </span>hit_set_type<span class="w"> </span>&lt;type&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;poolname&gt;<span class="w"> </span>hit_set_period<span class="w"> </span>&lt;period-in-seconds&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;poolname&gt;<span class="w"> </span>hit_set_count<span class="w"> </span>&lt;number-of-hitsets&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;poolname&gt;<span class="w"> </span>hit_set_fpp<span class="w"> </span>&lt;target-false-positive-rate&gt;</span>
</pre></div></div></section>
<section id="osd-no-sortbitwise">
<h4>OSD_NO_SORTBITWISE<a class="headerlink" href="#osd-no-sortbitwise" title="Permalink to this heading"></a></h4>
<p>没有在跑 luminous v12.y.z 之前的 OSD ，但却没有设置 <code class="docutils literal notranslate"><span class="pre">sortbitwise</span></code> 标记。</p>
<p>The <code class="docutils literal notranslate"><span class="pre">sortbitwise</span></code> flag must be set in order for OSDs running Luminous v12.y.z
or newer to start. To safely set the flag, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span><span class="nb">set</span><span class="w"> </span>sortbitwise</span>
</pre></div></div></section>
<section id="osd-filestore">
<h4>OSD_FILESTORE<a class="headerlink" href="#osd-filestore" title="Permalink to this heading"></a></h4>
<p>Warn if OSDs are running the old Filestore back end. The Filestore OSD back end is
deprecated; the BlueStore back end has been the default object store since the
Ceph Luminous release.</p>
<p>The ‘mclock_scheduler’ is not supported for Filestore OSDs. For this reason,
the default ‘osd_op_queue’ is set to ‘wpq’ for Filestore OSDs and is enforced
even if the user attempts to change it.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>report<span class="w"> </span><span class="p">|</span><span class="w"> </span>jq<span class="w"> </span>-c<span class="w"> </span><span class="s1">&#39;.&quot;osd_metadata&quot; | .[] | select(.osd_objectstore | contains(&quot;filestore&quot;)) | {id, osd_objectstore}&#39;</span></span>
</pre></div></div><p><strong>In order to upgrade to Reef or a later release, you must first migrate any
Filestore OSDs to BlueStore.</strong></p>
<p>If you are upgrading a pre-Reef release to Reef or later, but it is not
feasible to migrate Filestore OSDs to BlueStore immediately, you can
temporarily silence this alert by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>health<span class="w"> </span>mute<span class="w"> </span>OSD_FILESTORE</span>
</pre></div></div><p>Since migration of Filestore OSDs to BlueStore can take a considerable amount
of time to complete, we recommend that you begin the process well in advance
of any update to Reef or to later releases.</p>
</section>
<section id="osd-unreachable">
<h4>OSD_UNREACHABLE<a class="headerlink" href="#osd-unreachable" title="Permalink to this heading"></a></h4>
<p>Registered v1/v2 public address of one or more OSD(s) is/are out of the
defined <cite>public_network</cite> subnet, which will prevent these unreachable OSDs
from communicating with ceph clients properly.</p>
<p>Even though these unreachable OSDs are in up state, rados clients
will hang till TCP timeout before erroring out due to this inconsistency.</p>
</section>
<section id="pool-full">
<h4>POOL_FULL<a class="headerlink" href="#pool-full" title="Permalink to this heading"></a></h4>
<p>One or more pools have reached their quota and are no longer allowing writes.</p>
<p>To see pool quotas and utilization, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>df<span class="w"> </span>detail</span>
</pre></div></div><p>If you opt to raise the pool quota, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>set-quota<span class="w"> </span>&lt;poolname&gt;<span class="w"> </span>max_objects<span class="w"> </span>&lt;num-objects&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>set-quota<span class="w"> </span>&lt;poolname&gt;<span class="w"> </span>max_bytes<span class="w"> </span>&lt;num-bytes&gt;</span>
</pre></div></div><p>If not, delete some existing data to reduce utilization.</p>
</section>
<section id="bluefs-spillover">
<h4>BLUEFS_SPILLOVER<a class="headerlink" href="#bluefs-spillover" title="Permalink to this heading"></a></h4>
<p>One or more OSDs that use the BlueStore back end have been allocated <cite>db</cite>
partitions (that is, storage space for metadata, normally on a faster device),
but because that space has been filled, metadata has “spilled over” onto the
slow device. This is not necessarily an error condition or even unexpected
behavior, but may result in degraded performance. If the administrator had
expected that all metadata would fit on the faster device, this alert indicates
that not enough space was provided.</p>
<p>To disable this alert on all OSDs, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd<span class="w"> </span>bluestore_warn_on_bluefs_spillover<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div><p>Alternatively, to disable the alert on a specific OSD, run the following
command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd.123<span class="w"> </span>bluestore_warn_on_bluefs_spillover<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div><p>To secure more metadata space, you can destroy and reprovision the OSD in
question. This process involves data migration and recovery.</p>
<p>It might also be possible to expand the LVM logical volume that backs the <cite>db</cite>
storage. If the underlying LV has been expanded, you must stop the OSD daemon
and inform BlueFS of the device-size change by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph-bluestore-tool<span class="w"> </span>bluefs-bdev-expand<span class="w"> </span>--path<span class="w"> </span>/var/lib/ceph/osd/ceph-<span class="nv">$ID</span></span>
</pre></div></div></section>
<section id="bluefs-available-space">
<h4>BLUEFS_AVAILABLE_SPACE<a class="headerlink" href="#bluefs-available-space" title="Permalink to this heading"></a></h4>
<p>To see how much space is free for BlueFS, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>daemon<span class="w"> </span>osd.123<span class="w"> </span>bluestore<span class="w"> </span>bluefs<span class="w"> </span>available</span>
</pre></div></div><p>This will output up to three values: <code class="docutils literal notranslate"><span class="pre">BDEV_DB</span> <span class="pre">free</span></code>, <code class="docutils literal notranslate"><span class="pre">BDEV_SLOW</span> <span class="pre">free</span></code>, and
<code class="docutils literal notranslate"><span class="pre">available_from_bluestore</span></code>. <code class="docutils literal notranslate"><span class="pre">BDEV_DB</span></code> and <code class="docutils literal notranslate"><span class="pre">BDEV_SLOW</span></code> report the amount
of space that has been acquired by BlueFS and is now considered free. The value
<code class="docutils literal notranslate"><span class="pre">available_from_bluestore</span></code> indicates the ability of BlueStore to relinquish
more space to BlueFS.  It is normal for this value to differ from the amount of
BlueStore free space, because the BlueFS allocation unit is typically larger
than the BlueStore allocation unit.  This means that only part of the BlueStore
free space will be available for BlueFS.</p>
</section>
<section id="bluefs-low-space">
<h4>BLUEFS_LOW_SPACE<a class="headerlink" href="#bluefs-low-space" title="Permalink to this heading"></a></h4>
<p>If BlueFS is running low on available free space and there is not much free
space available from BlueStore (in other words, <cite>available_from_bluestore</cite> has
a low value), consider reducing the BlueFS allocation unit size. To simulate
available space when the allocation unit is different, run the following
command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>daemon<span class="w"> </span>osd.123<span class="w"> </span>bluestore<span class="w"> </span>bluefs<span class="w"> </span>available<span class="w"> </span>&lt;alloc-unit-size&gt;</span>
</pre></div></div></section>
<section id="bluestore-fragmentation">
<h4>BLUESTORE_FRAGMENTATION<a class="headerlink" href="#bluestore-fragmentation" title="Permalink to this heading"></a></h4>
<p>As BlueStore operates, the free space on the underlying storage will become
fragmented.  This is normal and unavoidable, but excessive fragmentation causes
slowdown.  To inspect BlueStore fragmentation, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>daemon<span class="w"> </span>osd.123<span class="w"> </span>bluestore<span class="w"> </span>allocator<span class="w"> </span>score<span class="w"> </span>block</span>
</pre></div></div><p>The fragmentation score is given in a [0-1] range.
[0.0 .. 0.4] tiny fragmentation
[0.4 .. 0.7] small, acceptable fragmentation
[0.7 .. 0.9] considerable, but safe fragmentation
[0.9 .. 1.0] severe fragmentation, might impact BlueFS’s ability to get space from BlueStore</p>
<p>To see a detailed report of free fragments, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>daemon<span class="w"> </span>osd.123<span class="w"> </span>bluestore<span class="w"> </span>allocator<span class="w"> </span>dump<span class="w"> </span>block</span>
</pre></div></div><p>For OSD processes that are not currently running, fragmentation can be
inspected with <cite>ceph-bluestore-tool</cite>. To see the fragmentation score, run the
following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph-bluestore-tool<span class="w"> </span>--path<span class="w"> </span>/var/lib/ceph/osd/ceph-123<span class="w"> </span>--allocator<span class="w"> </span>block<span class="w"> </span>free-score</span>
</pre></div></div><p>To dump detailed free chunks, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph-bluestore-tool<span class="w"> </span>--path<span class="w"> </span>/var/lib/ceph/osd/ceph-123<span class="w"> </span>--allocator<span class="w"> </span>block<span class="w"> </span>free-dump</span>
</pre></div></div></section>
<section id="bluestore-legacy-statfs">
<h4>BLUESTORE_LEGACY_STATFS<a class="headerlink" href="#bluestore-legacy-statfs" title="Permalink to this heading"></a></h4>
<p>One or more OSDs have BlueStore volumes that were created prior to the
Nautilus release. (In Nautilus, BlueStore tracks its internal usage
statistics on a granular, per-pool basis.)</p>
<p>If <em>all</em> OSDs
are older than Nautilus, this means that the per-pool metrics are
simply unavailable. But if there is a mixture of pre-Nautilus and
post-Nautilus OSDs, the cluster usage statistics reported by <code class="docutils literal notranslate"><span class="pre">ceph</span>
<span class="pre">df</span></code> will be inaccurate.</p>
<p>The old OSDs can be updated to use the new usage-tracking scheme by stopping
each OSD, running a repair operation, and then restarting the OSD. For example,
to update <code class="docutils literal notranslate"><span class="pre">osd.123</span></code>, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">systemctl<span class="w"> </span>stop<span class="w"> </span>ceph-osd@123</span>
<span class="prompt1">ceph-bluestore-tool<span class="w"> </span>repair<span class="w"> </span>--path<span class="w"> </span>/var/lib/ceph/osd/ceph-123</span>
<span class="prompt1">systemctl<span class="w"> </span>start<span class="w"> </span>ceph-osd@123</span>
</pre></div></div><p>此警报可以这样禁用：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bluestore_warn_on_legacy_statfs<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="bluestore-no-per-pool-omap">
<h4>BLUESTORE_NO_PER_POOL_OMAP<a class="headerlink" href="#bluestore-no-per-pool-omap" title="Permalink to this heading"></a></h4>
<p>One or more OSDs have volumes that were created prior to the Octopus release.
(In Octopus and later releases, BlueStore tracks omap space utilization by
pool.)</p>
<p>If there are any BlueStore OSDs that do not have the new tracking enabled, the
cluster will report an approximate value for per-pool omap usage based on the
most recent deep scrub.</p>
<p>The OSDs can be updated to track by pool by stopping each OSD, running a repair
operation, and then restarting the OSD. For example, to update <code class="docutils literal notranslate"><span class="pre">osd.123</span></code>, run
the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">systemctl<span class="w"> </span>stop<span class="w"> </span>ceph-osd@123</span>
<span class="prompt1">ceph-bluestore-tool<span class="w"> </span>repair<span class="w"> </span>--path<span class="w"> </span>/var/lib/ceph/osd/ceph-123</span>
<span class="prompt1">systemctl<span class="w"> </span>start<span class="w"> </span>ceph-osd@123</span>
</pre></div></div><p>此警报可以这样禁用：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bluestore_warn_on_no_per_pool_omap<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="bluestore-no-per-pg-omap">
<h4>BLUESTORE_NO_PER_PG_OMAP<a class="headerlink" href="#bluestore-no-per-pg-omap" title="Permalink to this heading"></a></h4>
<p>One or more OSDs have volumes that were created prior to Pacific.  (In Pacific
and later releases Bluestore tracks omap space utilitzation by Placement Group
(PG).)</p>
<p>Per-PG omap allows faster PG removal when PGs migrate.</p>
<p>The older OSDs can be updated to track by PG by stopping each OSD, running a
repair operation, and then restarting the OSD. For example, to update
<code class="docutils literal notranslate"><span class="pre">osd.123</span></code>, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">systemctl<span class="w"> </span>stop<span class="w"> </span>ceph-osd@123</span>
<span class="prompt1">ceph-bluestore-tool<span class="w"> </span>repair<span class="w"> </span>--path<span class="w"> </span>/var/lib/ceph/osd/ceph-123</span>
<span class="prompt1">systemctl<span class="w"> </span>start<span class="w"> </span>ceph-osd@123</span>
</pre></div></div><p>此警报可以这样禁用：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bluestore_warn_on_no_per_pg_omap<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="bluestore-disk-size-mismatch">
<h4>BLUESTORE_DISK_SIZE_MISMATCH<a class="headerlink" href="#bluestore-disk-size-mismatch" title="Permalink to this heading"></a></h4>
<p>One or more BlueStore OSDs have an internal inconsistency between the size of
the physical device and the metadata that tracks its size. This inconsistency
can lead to the OSD(s) crashing in the future.</p>
<p>The OSDs that have this inconsistency should be destroyed and reprovisioned. Be
very careful to execute this procedure on only one OSD at a time, so as to
minimize the risk of losing any data. To execute this procedure, where <code class="docutils literal notranslate"><span class="pre">$N</span></code>
is the OSD that has the inconsistency, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>out<span class="w"> </span>osd.<span class="nv">$N</span></span>
<span class="prompt1"><span class="k">while</span><span class="w"> </span>!<span class="w"> </span>ceph<span class="w"> </span>osd<span class="w"> </span>safe-to-destroy<span class="w"> </span>osd.<span class="nv">$N</span><span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="k">do</span><span class="w"> </span>sleep<span class="w"> </span>1m<span class="w"> </span><span class="p">;</span><span class="w"> </span><span class="k">done</span></span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>destroy<span class="w"> </span>osd.<span class="nv">$N</span></span>
<span class="prompt1">ceph-volume<span class="w"> </span>lvm<span class="w"> </span>zap<span class="w"> </span>/path/to/device</span>
<span class="prompt1">ceph-volume<span class="w"> </span>lvm<span class="w"> </span>create<span class="w"> </span>--osd-id<span class="w"> </span><span class="nv">$N</span><span class="w"> </span>--data<span class="w"> </span>/path/to/device</span>
</pre></div></div><div class="admonition note">
<p class="admonition-title">Note</p>
<p>Wait for this recovery procedure to completely on one OSD before running it
on the next.</p>
</div>
</section>
<section id="bluestore-no-compression">
<h4>BLUESTORE_NO_COMPRESSION<a class="headerlink" href="#bluestore-no-compression" title="Permalink to this heading"></a></h4>
<p>One or more OSDs is unable to load a BlueStore compression plugin.  This issue
might be caused by a broken installation, in which the <code class="docutils literal notranslate"><span class="pre">ceph-osd</span></code> binary does
not match the compression plugins. Or it might be caused by a recent upgrade in
which the <code class="docutils literal notranslate"><span class="pre">ceph-osd</span></code> daemon was not restarted.</p>
<p>To resolve this issue, verify that all of the packages on the host that is
running the affected OSD(s) are correctly installed and that the OSD daemon(s)
have been restarted. If the problem persists, check the OSD log for information
about the source of the problem.</p>
</section>
<section id="bluestore-spurious-read-errors">
<h4>BLUESTORE_SPURIOUS_READ_ERRORS<a class="headerlink" href="#bluestore-spurious-read-errors" title="Permalink to this heading"></a></h4>
<p>One or more BlueStore OSDs detect read errors on the main device.
BlueStore has recovered from these errors by retrying disk reads.  This alert
might indicate issues with underlying hardware, issues with the I/O subsystem,
or something similar.  Such issues can cause permanent data
corruption.  Some observations on the root cause of spurious read errors can be
found here: <a class="reference external" href="https://tracker.ceph.com/issues/22464">https://tracker.ceph.com/issues/22464</a></p>
<p>This alert does not require an immediate response, but the affected host might
need additional attention: for example, upgrading the host to the latest
OS/kernel versions and implementing hardware-resource-utilization monitoring.</p>
<p>To disable this alert on all OSDs, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd<span class="w"> </span>bluestore_warn_on_spurious_read_errors<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div><p>Or, to disable this alert on a specific OSD, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd.123<span class="w"> </span>bluestore_warn_on_spurious_read_errors<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="block-device-stalled-read-alert">
<h4>BLOCK_DEVICE_STALLED_READ_ALERT<a class="headerlink" href="#block-device-stalled-read-alert" title="Permalink to this heading"></a></h4>
<p>There are certain BlueStore log messages that surface storage drive issues
that can cause performance degradation and potentially data unavailability or
loss.</p>
<p><code class="docutils literal notranslate"><span class="pre">read</span> <span class="pre">stalled</span> <span class="pre">read</span> <span class="pre">0x29f40370000~100000</span> <span class="pre">(buffered)</span> <span class="pre">since</span> <span class="pre">63410177.290546s,</span> <span class="pre">timeout</span> <span class="pre">is</span> <span class="pre">5.000000s</span></code></p>
<p>However, this is difficult to spot as there’s no discernible warning (a
health warning or info in <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">detail</span></code> for example). More observations
can be found here: <a class="reference external" href="https://tracker.ceph.com/issues/62500">https://tracker.ceph.com/issues/62500</a></p>
<p>As there can be false positive <code class="docutils literal notranslate"><span class="pre">stalled</span> <span class="pre">read</span></code> instances, a mechanism
has been added for more reliability. If in last <code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_lifetime</span></code>
duration the number of <code class="docutils literal notranslate"><span class="pre">stalled</span> <span class="pre">read</span></code> indications are found to be more than or equal to
<code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_threshold</span></code> for a given BlueStore block device, this
warning will be reported in <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">detail</span></code>.</p>
<p>By default value of <code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_lifetime</span> <span class="pre">=</span> <span class="pre">86400s</span></code> and
<code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_threshold</span> <span class="pre">=</span> <span class="pre">1</span></code>. But user can configure it for
individual OSDs.</p>
<p>To change this, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bdev_stalled_read_warn_lifetime<span class="w"> </span><span class="m">10</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bdev_stalled_read_warn_threshold<span class="w"> </span><span class="m">5</span></span>
</pre></div></div><p>this may be done surgically for individual OSDs or a given mask</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd.123<span class="w"> </span>bdev_stalled_read_warn_lifetime<span class="w"> </span><span class="m">10</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd.123<span class="w"> </span>bdev_stalled_read_warn_threshold<span class="w"> </span><span class="m">5</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>class:ssd<span class="w"> </span>bdev_stalled_read_warn_lifetime<span class="w"> </span><span class="m">10</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>class:ssd<span class="w"> </span>bdev_stalled_read_warn_threshold<span class="w"> </span><span class="m">5</span></span>
</pre></div></div></section>
<section id="wal-device-stalled-read-alert">
<h4>WAL_DEVICE_STALLED_READ_ALERT<a class="headerlink" href="#wal-device-stalled-read-alert" title="Permalink to this heading"></a></h4>
<p>A similar warning like <code class="docutils literal notranslate"><span class="pre">BLOCK_DEVICE_STALLED_READ_ALERT</span></code> will be raised to
identify <code class="docutils literal notranslate"><span class="pre">stalled</span> <span class="pre">read</span></code> instances on a given BlueStore OSD’s <code class="docutils literal notranslate"><span class="pre">WAL_DEVICE</span></code>.
This warning can be configured via <code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_lifetime</span></code> and
<code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_threshold</span></code> parameters similarly described in the
<code class="docutils literal notranslate"><span class="pre">BLOCK_DEVICE_STALLED_READ_ALERT</span></code> warning section.</p>
</section>
<section id="db-device-stalled-read-alert">
<h4>DB_DEVICE_STALLED_READ_ALERT<a class="headerlink" href="#db-device-stalled-read-alert" title="Permalink to this heading"></a></h4>
<p>A similar warning like <code class="docutils literal notranslate"><span class="pre">BLOCK_DEVICE_STALLED_READ_ALERT</span></code> will be raised to
identify <code class="docutils literal notranslate"><span class="pre">stalled</span> <span class="pre">read</span></code> instances on a given BlueStore OSD’s <code class="docutils literal notranslate"><span class="pre">WAL_DEVICE</span></code>.
This warning can be configured via <code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_lifetime</span></code> and
<code class="docutils literal notranslate"><span class="pre">bdev_stalled_read_warn_threshold</span></code> parameters similarly described in the
<code class="docutils literal notranslate"><span class="pre">BLOCK_DEVICE_STALLED_READ_ALERT</span></code> warning section.</p>
</section>
<section id="bluestore-slow-op-alert">
<h4>BLUESTORE_SLOW_OP_ALERT<a class="headerlink" href="#bluestore-slow-op-alert" title="Permalink to this heading"></a></h4>
<p>There are certain BlueStore log messages that surface storage drive issues
that can lead to performance degradation and data unavailability or loss.</p>
<p><code class="docutils literal notranslate"><span class="pre">log_latency_fn</span> <span class="pre">slow</span> <span class="pre">operation</span> <span class="pre">observed</span> <span class="pre">for</span> <span class="pre">_txc_committed_kv,</span> <span class="pre">latency</span> <span class="pre">=</span> <span class="pre">12.028621219s,</span> <span class="pre">txc</span> <span class="pre">=</span> <span class="pre">0x55a107c30f00</span></code>
<code class="docutils literal notranslate"><span class="pre">log_latency_fn</span> <span class="pre">slow</span> <span class="pre">operation</span> <span class="pre">observed</span> <span class="pre">for</span> <span class="pre">upper_bound,</span> <span class="pre">latency</span> <span class="pre">=</span> <span class="pre">6.25955s</span></code>
<code class="docutils literal notranslate"><span class="pre">log_latency</span> <span class="pre">slow</span> <span class="pre">operation</span> <span class="pre">observed</span> <span class="pre">for</span> <span class="pre">submit_transaction..</span></code></p>
<p>As there can be false positive <code class="docutils literal notranslate"><span class="pre">slow</span> <span class="pre">ops</span></code> instances, a mechanism has
been added for more reliability. If in last <code class="docutils literal notranslate"><span class="pre">bluestore_slow_ops_warn_lifetime</span></code>
duration <code class="docutils literal notranslate"><span class="pre">slow</span> <span class="pre">ops</span></code> indications are found more than or equal to
<code class="docutils literal notranslate"><span class="pre">bluestore_slow_ops_warn_threshold</span></code> for a given BlueStore OSD, this warning
will be reported in <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">detail</span></code>.</p>
<p>By default value of <code class="docutils literal notranslate"><span class="pre">bluestore_slow_ops_warn_lifetime</span> <span class="pre">=</span> <span class="pre">86400s</span></code> and
<code class="docutils literal notranslate"><span class="pre">bluestore_slow_ops_warn_threshold</span> <span class="pre">=</span> <span class="pre">1</span></code>. But user can configure it for
individual OSDs.</p>
<p>To change this, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bluestore_slow_ops_warn_lifetime<span class="w"> </span><span class="m">10</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>bluestore_slow_ops_warn_threshold<span class="w"> </span><span class="m">5</span></span>
</pre></div></div><p>this may be done surgically for individual OSDs or a given mask</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd.123<span class="w"> </span>bluestore_slow_ops_warn_lifetime<span class="w"> </span><span class="m">10</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd.123<span class="w"> </span>bluestore_slow_ops_warn_threshold<span class="w"> </span><span class="m">5</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>class:ssd<span class="w"> </span>bluestore_slow_ops_warn_lifetime<span class="w"> </span><span class="m">10</span></span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>class:ssd<span class="w"> </span>bluestore_slow_ops_warn_threshold<span class="w"> </span><span class="m">5</span></span>
</pre></div></div></section>
</section>
<section id="id6">
<h3>设备健康<a class="headerlink" href="#id6" title="Permalink to this heading"></a></h3>
<section id="device-health">
<h4>DEVICE_HEALTH<a class="headerlink" href="#device-health" title="Permalink to this heading"></a></h4>
<p>One or more OSD devices are expected to fail soon, where the warning threshold
is determined by the <code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/warn_threshold</span></code> config option.</p>
<p>Because this alert applies only to OSDs that are currently marked <code class="docutils literal notranslate"><span class="pre">in</span></code>, the
appropriate response to this expected failure is (1) to mark the OSD <code class="docutils literal notranslate"><span class="pre">out</span></code> so
that data is migrated off of the OSD, and then (2) to remove the hardware from
the system. Note that this marking <code class="docutils literal notranslate"><span class="pre">out</span></code> is normally done automatically if
<code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/self_heal</span></code> is enabled (as determined by
<code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/mark_out_threshold</span></code>).  If an OSD device is compromised but
the OSD(s) on that device are still <code class="docutils literal notranslate"><span class="pre">up</span></code>, recovery can be degraded.  In such
cases it may be advantageous to forcibly stop the OSD daemon(s) in question so
that recovery can proceed from surviving healthly OSDs.  This should only be
done with extreme care so that data availability is not compromised.</p>
<p>To check device health, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>device<span class="w"> </span>info<span class="w"> </span>&lt;device-id&gt;</span>
</pre></div></div><p>Device life expectancy is set either by a prediction model that the Manager
runs or by an external tool that is activated by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>device<span class="w"> </span>set-life-expectancy<span class="w"> </span>&lt;device-id&gt;<span class="w"> </span>&lt;from&gt;<span class="w"> </span>&lt;to&gt;</span>
</pre></div></div><p>You can change the stored life expectancy manually, but such a change usually
doesn’t accomplish anything. The reason for this is that whichever tool
originally set the stored life expectancy will probably undo your change by
setting it again, and a change to the stored value does not affect the actual
health of the hardware device.</p>
</section>
<section id="device-health-in-use">
<h4>DEVICE_HEALTH_IN_USE<a class="headerlink" href="#device-health-in-use" title="Permalink to this heading"></a></h4>
<p>One or more devices (that is, OSDs) are expected to fail soon and have been
marked <code class="docutils literal notranslate"><span class="pre">out</span></code> of the cluster (as controlled by
<code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/mark_out_threshold</span></code>), but they are still participating in
one or more Placement Groups. This might be because the OSD(s) were marked
<code class="docutils literal notranslate"><span class="pre">out</span></code> only recently and data is still migrating, or because data cannot be
migrated off of the OSD(s) for some reason (for example, the cluster is nearly
full, or the CRUSH hierarchy is structured so that there isn’t another suitable
OSD to migrate the data to).</p>
<p>This message can be silenced by disabling self-heal behavior (that is, setting
<code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/self_heal</span></code> to <code class="docutils literal notranslate"><span class="pre">false</span></code>), by adjusting
<code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/mark_out_threshold</span></code>, or by addressing whichever condition
is preventing data from being migrated off of the ailing OSD(s).</p>
</section>
<section id="device-health-toomany">
<span id="rados-health-checks-device-health-toomany"></span><h4>DEVICE_HEALTH_TOOMANY<a class="headerlink" href="#device-health-toomany" title="Permalink to this heading"></a></h4>
<p>Too many devices (that is, OSDs) are expected to fail soon, and because
<code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/self_heal</span></code> behavior is enabled, marking <code class="docutils literal notranslate"><span class="pre">out</span></code> all of the
ailing OSDs would exceed the cluster’s <code class="docutils literal notranslate"><span class="pre">mon_osd_min_in_ratio</span></code> ratio.  This
ratio prevents a cascade of too many OSDs from being automatically marked
<code class="docutils literal notranslate"><span class="pre">out</span></code>.</p>
<p>You should promptly add new OSDs to the cluster to prevent data loss, or
incrementally replace the failing OSDs.</p>
<p>Alternatively, you can silence this health check by adjusting options including
<code class="docutils literal notranslate"><span class="pre">mon_osd_min_in_ratio</span></code> or <code class="docutils literal notranslate"><span class="pre">mgr/devicehealth/mark_out_threshold</span></code>.  Be
warned, however, that this will increase the likelihood of unrecoverable data
loss.</p>
</section>
</section>
<section id="id7">
<h3>数据健康（存储池和归置组们）<a class="headerlink" href="#id7" title="Permalink to this heading"></a></h3>
<section id="pg-availability">
<h4>PG_AVAILABILITY<a class="headerlink" href="#pg-availability" title="Permalink to this heading"></a></h4>
<p>Data availability is reduced. In other words, the cluster is unable to service
potential read or write requests for at least some data in the cluster.  More
precisely, one or more Placement Groups (PGs) are in a state that does not
allow I/O requests to be serviced. Any of the following PG states are
problematic if they do not clear quickly: <em>peering</em>, <em>stale</em>, <em>incomplete</em>, and
the lack of <em>active</em>.</p>
<p>For detailed information about which PGs are affected, run the following
command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>health<span class="w"> </span>detail</span>
</pre></div></div><p>In most cases, the root cause of this issue is that one or more OSDs are
currently <code class="docutils literal notranslate"><span class="pre">down</span></code>: see <code class="docutils literal notranslate"><span class="pre">OSD_DOWN</span></code> above.</p>
<p>To see the state of a specific problematic PG, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>tell<span class="w"> </span>&lt;pgid&gt;<span class="w"> </span>query</span>
</pre></div></div></section>
<section id="pg-degraded">
<h4>PG_DEGRADED<a class="headerlink" href="#pg-degraded" title="Permalink to this heading"></a></h4>
<p>Data redundancy is reduced for some data: in other words, the cluster does not
have the desired number of replicas for all data (in the case of replicated
pools) or erasure code fragments (in the case of erasure-coded pools).  More
precisely, one or more Placement Groups (PGs):</p>
<ul class="simple">
<li><p>have the <em>degraded</em> or <em>undersized</em> flag set, which means that there are not
enough instances of that PG in the cluster; or</p></li>
<li><p>have not had the <em>clean</em> state set for a long time.</p></li>
</ul>
<p>For detailed information about which PGs are affected, run the following
command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>health<span class="w"> </span>detail</span>
</pre></div></div><p>In most cases, the root cause of this issue is that one or more OSDs are
currently “down”: see <code class="docutils literal notranslate"><span class="pre">OSD_DOWN</span></code> above.</p>
<p>To see the state of a specific problematic PG, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>tell<span class="w"> </span>&lt;pgid&gt;<span class="w"> </span>query</span>
</pre></div></div></section>
<section id="pg-recovery-full">
<h4>PG_RECOVERY_FULL<a class="headerlink" href="#pg-recovery-full" title="Permalink to this heading"></a></h4>
<p>Data redundancy might be reduced or even put at risk for some data due to a
lack of free space in the cluster. More precisely, one or more Placement Groups
have the <em>recovery_toofull</em> flag set, which means that the cluster is unable to
migrate or recover data because one or more OSDs are above the <code class="docutils literal notranslate"><span class="pre">full</span></code>
threshold.</p>
<p>For steps to resolve this condition, see <em>OSD_FULL</em> above.</p>
</section>
<section id="pg-backfill-full">
<h4>PG_BACKFILL_FULL<a class="headerlink" href="#pg-backfill-full" title="Permalink to this heading"></a></h4>
<p>Data redundancy might be reduced or even put at risk for some data due to a
lack of free space in the cluster. More precisely, one or more Placement Groups
have the <em>backfill_toofull</em> flag set, which means that the cluster is unable to
migrate or recover data because one or more OSDs are above the <code class="docutils literal notranslate"><span class="pre">backfillfull</span></code>
threshold.</p>
<p>For steps to resolve this condition, see <em>OSD_BACKFILLFULL</em> above.</p>
</section>
<section id="pg-damaged">
<h4>PG_DAMAGED<a class="headerlink" href="#pg-damaged" title="Permalink to this heading"></a></h4>
<p>Data scrubbing has discovered problems with data consistency in the cluster.
More precisely, one or more Placement Groups either (1) have the <em>inconsistent</em>
or <code class="docutils literal notranslate"><span class="pre">snaptrim_error</span></code> flag set, which indicates that an earlier data scrub
operation found a problem, or (2) have the <em>repair</em> flag set, which means that
a repair for such an inconsistency is currently in progress.</p>
<p>详情见 <a class="reference internal" href="../../troubleshooting/troubleshooting-pg/"><span class="doc">归置组排障</span></a> 。</p>
</section>
<section id="osd-scrub-errors">
<h4>OSD_SCRUB_ERRORS<a class="headerlink" href="#osd-scrub-errors" title="Permalink to this heading"></a></h4>
<p>近期的 OSD 洗刷出现了明显不一致的地方。这个错误一般和
<em>PG_DAMAGED</em> （见上文）成对出现。</p>
<p>详情见 <a class="reference internal" href="../../troubleshooting/troubleshooting-pg/"><span class="doc">归置组排障</span></a> 。</p>
</section>
<section id="osd-too-many-repairs">
<h4>OSD_TOO_MANY_REPAIRS<a class="headerlink" href="#osd-too-many-repairs" title="Permalink to this heading"></a></h4>
<p>The count of read repairs has exceeded the config value threshold
<code class="docutils literal notranslate"><span class="pre">mon_osd_warn_num_repaired</span></code> (default: <code class="docutils literal notranslate"><span class="pre">10</span></code>).  Because scrub handles errors
only for data at rest, and because any read error that occurs when another
replica is available will be repaired immediately so that the client can get
the object data, there might exist failing disks that are not registering any
scrub errors. This repair count is maintained as a way of identifying any such
failing disks.</p>
<p>In order to allow clearing of the warning, a new command
<code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">tell</span> <span class="pre">osd.#</span> <span class="pre">clear_shards_repaired</span> <span class="pre">[count]</span></code> has been added.
By default it will set the repair count to 0. A <cite>count</cite> value can be passed
to the command. Thus, the administrator has the option to re-enable the warning
by passing the value of <code class="docutils literal notranslate"><span class="pre">mon_osd_warn_num_repaired</span></code> (or above) to the command.
An alternative to using <cite>clear_shards_repaired</cite> is to mute the
<cite>OSD_TOO_MANY_REPAIRS</cite> alert with <cite>ceph health mute</cite>.</p>
</section>
<section id="large-omap-objects">
<h4>LARGE_OMAP_OBJECTS<a class="headerlink" href="#large-omap-objects" title="Permalink to this heading"></a></h4>
<p>One or more pools contain large omap objects, as determined by
<code class="docutils literal notranslate"><span class="pre">osd_deep_scrub_large_omap_object_key_threshold</span></code> (threshold for the number of
keys to determine what is considered a large omap object) or
<code class="docutils literal notranslate"><span class="pre">osd_deep_scrub_large_omap_object_value_sum_threshold</span></code> (the threshold for the
summed size in bytes of all key values to determine what is considered a large
omap object) or both.  To find more information on object name, key count, and
size in bytes, search the cluster log for ‘Large omap object found’. This issue
can be caused by RGW-bucket index objects that do not have automatic resharding
enabled. For more information on resharding, see <a class="reference internal" href="../../../radosgw/dynamicresharding/#rgw-dynamic-bucket-index-resharding"><span class="std std-ref">RGW Dynamic Bucket Index
Resharding</span></a>.</p>
<p>To adjust the thresholds mentioned above, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd<span class="w"> </span>osd_deep_scrub_large_omap_object_key_threshold<span class="w"> </span>&lt;keys&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd<span class="w"> </span>osd_deep_scrub_large_omap_object_value_sum_threshold<span class="w"> </span>&lt;bytes&gt;</span>
</pre></div></div></section>
<section id="cache-pool-near-full">
<h4>CACHE_POOL_NEAR_FULL<a class="headerlink" href="#cache-pool-near-full" title="Permalink to this heading"></a></h4>
<p>A cache-tier pool is nearly full, as determined by the <code class="docutils literal notranslate"><span class="pre">target_max_bytes</span></code> and
<code class="docutils literal notranslate"><span class="pre">target_max_objects</span></code> properties of the cache pool. Once the pool reaches the
target threshold, write requests to the pool might block while data is flushed
and evicted from the cache. This state normally leads to very high latencies
and poor performance.</p>
<p>To adjust the cache pool’s target size, run the following commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;cache-pool-name&gt;<span class="w"> </span>target_max_bytes<span class="w"> </span>&lt;bytes&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;cache-pool-name&gt;<span class="w"> </span>target_max_objects<span class="w"> </span>&lt;objects&gt;</span>
</pre></div></div><p>There might be other reasons that normal cache flush and evict activity are
throttled: for example, reduced availability of the base tier, reduced
performance of the base tier, or overall cluster load.</p>
</section>
<section id="too-few-pgs">
<h4>TOO_FEW_PGS<a class="headerlink" href="#too-few-pgs" title="Permalink to this heading"></a></h4>
<p>The number of Placement Groups (PGs) that are in use in the cluster is below
the configurable threshold of <code class="docutils literal notranslate"><span class="pre">mon_pg_warn_min_per_osd</span></code> PGs per OSD. This can
lead to suboptimal distribution and suboptimal balance of data across the OSDs
in the cluster, and a reduction of overall performance.</p>
<p>If data pools have not yet been created, this condition is expected.</p>
<p>To address this issue, you can increase the PG count for existing pools or
create new pools.  For more information, see
<a class="reference internal" href="../placement-groups/#choosing-number-of-placement-groups"><span class="std std-ref">确定 PG 数量</span></a>.</p>
</section>
<section id="pool-pg-num-not-power-of-two">
<h4>POOL_PG_NUM_NOT_POWER_OF_TWO<a class="headerlink" href="#pool-pg-num-not-power-of-two" title="Permalink to this heading"></a></h4>
<p>One or more pools have a <code class="docutils literal notranslate"><span class="pre">pg_num</span></code> value that is not a power of two.  Although
this is not strictly incorrect, it does lead to a less balanced distribution of
data because some Placement Groups will have roughly twice as much data as
others have.</p>
<p>This is easily corrected by setting the <code class="docutils literal notranslate"><span class="pre">pg_num</span></code> value for the affected
pool(s) to a nearby power of two. To do so, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_num<span class="w"> </span>&lt;value&gt;</span>
</pre></div></div><p>To disable this health check, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>mon_warn_on_pool_pg_num_not_power_of_two<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="pool-too-few-pgs">
<h4>POOL_TOO_FEW_PGS<a class="headerlink" href="#pool-too-few-pgs" title="Permalink to this heading"></a></h4>
<p>One or more pools should probably have more Placement Groups (PGs), given the
amount of data that is currently stored in the pool. This issue can lead to
suboptimal distribution and suboptimal balance of data across the OSDs in the
cluster, and a reduction of overall performance. This alert is raised only if
the <code class="docutils literal notranslate"><span class="pre">pg_autoscale_mode</span></code> property on the pool is set to <code class="docutils literal notranslate"><span class="pre">warn</span></code>.</p>
<p>To disable the alert, entirely disable auto-scaling of PGs for the pool by
running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_autoscale_mode<span class="w"> </span>off</span>
</pre></div></div><p>To allow the cluster to automatically adjust the number of PGs for the pool,
run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_autoscale_mode<span class="w"> </span>on</span>
</pre></div></div><p>Alternatively, to manually set the number of PGs for the pool to the
recommended amount, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_num<span class="w"> </span>&lt;new-pg-num&gt;</span>
</pre></div></div><p>For more information, see <a class="reference internal" href="../placement-groups/#choosing-number-of-placement-groups"><span class="std std-ref">确定 PG 数量</span></a> and
<a class="reference internal" href="../placement-groups/#pg-autoscaler"><span class="std std-ref">自动伸缩归置组</span></a>.</p>
</section>
<section id="too-many-pgs">
<h4>TOO_MANY_PGS<a class="headerlink" href="#too-many-pgs" title="Permalink to this heading"></a></h4>
<p>The number of Placement Groups (PGs) in use in the cluster is above the
configurable threshold of <code class="docutils literal notranslate"><span class="pre">mon_max_pg_per_osd</span></code> PGs per OSD. If this threshold
is exceeded, the cluster will not allow new pools to be created, pool <cite>pg_num</cite>
to be increased, or pool replication to be increased (any of which, if allowed,
would lead to more PGs in the cluster). A large number of PGs can lead to
higher memory utilization for OSD daemons, slower peering after cluster state
changes (for example, OSD restarts, additions, or removals), and higher load on
the Manager and Monitor daemons.</p>
<p>The simplest way to mitigate the problem is to increase the number of OSDs in
the cluster by adding more hardware. Note that, because the OSD count that is
used for the purposes of this health check is the number of <code class="docutils literal notranslate"><span class="pre">in</span></code> OSDs,
marking <code class="docutils literal notranslate"><span class="pre">out</span></code> OSDs <code class="docutils literal notranslate"><span class="pre">in</span></code> (if there are any <code class="docutils literal notranslate"><span class="pre">out</span></code> OSDs available) can also
help. To do so, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span><span class="k">in</span><span class="w"> </span>&lt;osd<span class="w"> </span>id<span class="o">(</span>s<span class="o">)</span>&gt;</span>
</pre></div></div><p>For more information, see <a class="reference internal" href="../placement-groups/#choosing-number-of-placement-groups"><span class="std std-ref">确定 PG 数量</span></a>.</p>
</section>
<section id="pool-too-many-pgs">
<h4>POOL_TOO_MANY_PGS<a class="headerlink" href="#pool-too-many-pgs" title="Permalink to this heading"></a></h4>
<p>One or more pools should probably have fewer Placement Groups (PGs), given the
amount of data that is currently stored in the pool. This issue can lead to
higher memory utilization for OSD daemons, slower peering after cluster state
changes (for example, OSD restarts, additions, or removals), and higher load on
the Manager and Monitor daemons. This alert is raised only if the
<code class="docutils literal notranslate"><span class="pre">pg_autoscale_mode</span></code> property on the pool is set to <code class="docutils literal notranslate"><span class="pre">warn</span></code>.</p>
<p>To disable the alert, entirely disable auto-scaling of PGs for the pool by
running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_autoscale_mode<span class="w"> </span>off</span>
</pre></div></div><p>To allow the cluster to automatically adjust the number of PGs for the pool,
run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_autoscale_mode<span class="w"> </span>on</span>
</pre></div></div><p>Alternatively, to manually set the number of PGs for the pool to the
recommended amount, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>pg_num<span class="w"> </span>&lt;new-pg-num&gt;</span>
</pre></div></div><p>For more information, see <a class="reference internal" href="../placement-groups/#choosing-number-of-placement-groups"><span class="std std-ref">确定 PG 数量</span></a> and
<a class="reference internal" href="../placement-groups/#pg-autoscaler"><span class="std std-ref">自动伸缩归置组</span></a>.</p>
</section>
<section id="pool-target-size-bytes-overcommitted">
<h4>POOL_TARGET_SIZE_BYTES_OVERCOMMITTED<a class="headerlink" href="#pool-target-size-bytes-overcommitted" title="Permalink to this heading"></a></h4>
<p>One or more pools have a <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> property that is set in order to
estimate the expected size of the pool, but the value(s) of this property are
greater than the total available storage (either by themselves or in
combination with other pools).</p>
<p>This alert is usually an indication that the <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> value for
the pool is too large and should be reduced or set to zero. To reduce the
<code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> value or set it to zero, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>target_size_bytes<span class="w"> </span><span class="m">0</span></span>
</pre></div></div><p>The above command sets the value of <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> to zero. To set the
value of <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> to a non-zero value, replace the <code class="docutils literal notranslate"><span class="pre">0</span></code> with that
non-zero value.</p>
<p>更详细的内容见  <a class="reference internal" href="../placement-groups/#specifying-pool-target-size"><span class="std std-ref">配置期望的存储池尺寸</span></a>.</p>
</section>
<section id="pool-has-target-size-bytes-and-ratio">
<h4>POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO<a class="headerlink" href="#pool-has-target-size-bytes-and-ratio" title="Permalink to this heading"></a></h4>
<p>One or more pools have both <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> and <code class="docutils literal notranslate"><span class="pre">target_size_ratio</span></code> set
in order to estimate the expected size of the pool.  Only one of these
properties should be non-zero. If both are set to a non-zero value, then
<code class="docutils literal notranslate"><span class="pre">target_size_ratio</span></code> takes precedence and <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> is ignored.</p>
<p>To reset <code class="docutils literal notranslate"><span class="pre">target_size_bytes</span></code> to zero, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool-name&gt;<span class="w"> </span>target_size_bytes<span class="w"> </span><span class="m">0</span></span>
</pre></div></div><p>更详细的内容见  <a class="reference internal" href="../placement-groups/#specifying-pool-target-size"><span class="std std-ref">配置期望的存储池尺寸</span></a>.</p>
</section>
<section id="too-few-osds">
<h4>TOO_FEW_OSDS<a class="headerlink" href="#too-few-osds" title="Permalink to this heading"></a></h4>
<p>The number of OSDs in the cluster is below the configurable threshold of
<code class="docutils literal notranslate"><span class="pre">osd_pool_default_size</span></code>. This means that some or all data may not be able to
satisfy the data protection policy specified in CRUSH rules and pool settings.</p>
</section>
<section id="smaller-pgp-num">
<h4>SMALLER_PGP_NUM<a class="headerlink" href="#smaller-pgp-num" title="Permalink to this heading"></a></h4>
<p>One or more pools have a <code class="docutils literal notranslate"><span class="pre">pgp_num</span></code> value less than <code class="docutils literal notranslate"><span class="pre">pg_num</span></code>. This alert is
normally an indication that the Placement Group (PG) count was increased
without any increase in the placement behavior.</p>
<p>This disparity is sometimes brought about deliberately, in order to separate
out the <cite>split</cite> step when the PG count is adjusted from the data migration that
is needed when <code class="docutils literal notranslate"><span class="pre">pgp_num</span></code> is changed.</p>
<p>This issue is normally resolved by setting <code class="docutils literal notranslate"><span class="pre">pgp_num</span></code> to match <code class="docutils literal notranslate"><span class="pre">pg_num</span></code>, so
as to trigger the data migration, by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span><span class="nb">set</span><span class="w"> </span>&lt;pool&gt;<span class="w"> </span>pgp_num<span class="w"> </span>&lt;pg-num-value&gt;</span>
</pre></div></div></section>
<section id="many-objects-per-pg">
<h4>MANY_OBJECTS_PER_PG<a class="headerlink" href="#many-objects-per-pg" title="Permalink to this heading"></a></h4>
<p>One or more pools have an average number of objects per Placement Group (PG)
that is significantly higher than the overall cluster average. The specific
threshold is determined by the <code class="docutils literal notranslate"><span class="pre">mon_pg_warn_max_object_skew</span></code> configuration
value.</p>
<p>This alert is usually an indication that the pool(s) that contain most of the
data in the cluster have too few PGs, or that other pools that contain less
data have too many PGs. See <em>TOO_MANY_PGS</em> above.</p>
<p>在管理器上调高 <code class="docutils literal notranslate"><span class="pre">mon_pg_warn_max_object_skew</span></code> 配置选项的阈值可以消除此健康告警。</p>
<p>如果把 <code class="docutils literal notranslate"><span class="pre">pg_autoscale_mode</span></code> 设置为 <code class="docutils literal notranslate"><span class="pre">on</span></code> ，某个特定存储池的健康告警就能消除。</p>
</section>
<section id="pool-app-not-enabled">
<h4>POOL_APP_NOT_ENABLED<a class="headerlink" href="#pool-app-not-enabled" title="Permalink to this heading"></a></h4>
<p>A pool exists but the pool has not been tagged for use by a particular
application.</p>
<p>To resolve this issue, tag the pool for use by an application. For
example, if the pool is used by RBD, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">rbd<span class="w"> </span>pool<span class="w"> </span>init<span class="w"> </span>&lt;poolname&gt;</span>
</pre></div></div><p>Alternatively, if the pool is being used by a custom application (here ‘foo’),
you can label the pool by running the following low-level command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>application<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>foo</span>
</pre></div></div><p>更详细的内容见  <a class="reference internal" href="../pools/#associate-pool-to-application"><span class="std std-ref">关联存储池与应用程序</span></a>.</p>
</section>
<section id="id8">
<h4>POOL_FULL<a class="headerlink" href="#id8" title="Permalink to this heading"></a></h4>
<p>One or more pools have reached (or are very close to reaching) their quota. The
threshold to raise this health check is determined by the
<code class="docutils literal notranslate"><span class="pre">mon_pool_quota_crit_threshold</span></code> configuration option.</p>
<p>Pool quotas can be adjusted up or down (or removed) by running the following
commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>set-quota<span class="w"> </span>&lt;pool&gt;<span class="w"> </span>max_bytes<span class="w"> </span>&lt;bytes&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>set-quota<span class="w"> </span>&lt;pool&gt;<span class="w"> </span>max_objects<span class="w"> </span>&lt;objects&gt;</span>
</pre></div></div><p>To disable a quota, set the quota value to 0.</p>
</section>
<section id="pool-near-full">
<h4>POOL_NEAR_FULL<a class="headerlink" href="#pool-near-full" title="Permalink to this heading"></a></h4>
<p>One or more pools are approaching a configured fullness threshold.</p>
<p>One of the several thresholds that can raise this health check is determined by
the <code class="docutils literal notranslate"><span class="pre">mon_pool_quota_warn_threshold</span></code> configuration option.</p>
<p>Pool quotas can be adjusted up or down (or removed) by running the following
commands:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>set-quota<span class="w"> </span>&lt;pool&gt;<span class="w"> </span>max_bytes<span class="w"> </span>&lt;bytes&gt;</span>
<span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>pool<span class="w"> </span>set-quota<span class="w"> </span>&lt;pool&gt;<span class="w"> </span>max_objects<span class="w"> </span>&lt;objects&gt;</span>
</pre></div></div><p>To disable a quota, set the quota value to 0.</p>
<p>Other thresholds that can raise the two health checks above are
<code class="docutils literal notranslate"><span class="pre">mon_osd_nearfull_ratio</span></code> and <code class="docutils literal notranslate"><span class="pre">mon_osd_full_ratio</span></code>. For details and
resolution, see <a class="reference internal" href="../../configuration/mon-config-ref/#storage-capacity"><span class="std std-ref">存储容量</span></a> and <a class="reference internal" href="../../troubleshooting/troubleshooting-osd/#no-free-drive-space"><span class="std std-ref">硬盘没剩余空间</span></a>.</p>
</section>
<section id="object-misplaced">
<h4>OBJECT_MISPLACED<a class="headerlink" href="#object-misplaced" title="Permalink to this heading"></a></h4>
<p>One or more objects in the cluster are not stored on the node that CRUSH would
prefer that they be stored on. This alert is an indication that data migration
due to a recent cluster change has not yet completed.</p>
<p>Misplaced data is not a dangerous condition in and of itself; data consistency
is never at risk, and old copies of objects will not be removed until the
desired number of new copies (in the desired locations) has been created.</p>
</section>
<section id="object-unfound">
<h4>OBJECT_UNFOUND<a class="headerlink" href="#object-unfound" title="Permalink to this heading"></a></h4>
<p>One or more objects in the cluster cannot be found. More precisely, the OSDs
know that a new or updated copy of an object should exist, but no such copy has
been found on OSDs that are currently online.</p>
<p>Read or write requests to unfound objects will block.</p>
<p>Ideally, a “down” OSD that has a more recent copy of the unfound object can be
brought back online. To identify candidate OSDs, check the peering state of the
PG(s) responsible for the unfound object. To see the peering state, run the
following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>tell<span class="w"> </span>&lt;pgid&gt;<span class="w"> </span>query</span>
</pre></div></div><p>On the other hand, if the latest copy of the object is not available, the
cluster can be told to roll back to a previous version of the object. For more
information, see <a class="reference internal" href="../../troubleshooting/troubleshooting-pg/#failures-osd-unfound"><span class="std std-ref">未找到的对象</span></a>.</p>
</section>
<section id="slow-ops">
<h4>SLOW_OPS<a class="headerlink" href="#slow-ops" title="Permalink to this heading"></a></h4>
<p>One or more OSD requests or monitor requests are taking a long time to process.
This alert might be an indication of extreme load, a slow storage device, or a
software bug.</p>
<p>To query the request queue for the daemon that is causing the slowdown, run the
following command from the daemon’s host:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>daemon<span class="w"> </span>osd.&lt;id&gt;<span class="w"> </span>ops</span>
</pre></div></div><p>To see a summary of the slowest recent requests, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>daemon<span class="w"> </span>osd.&lt;id&gt;<span class="w"> </span>dump_historic_ops</span>
</pre></div></div><p>OSD 的位置可用此命令找到：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>osd<span class="w"> </span>find<span class="w"> </span>osd.&lt;id&gt;</span>
</pre></div></div></section>
<section id="pg-not-scrubbed">
<h4>PG_NOT_SCRUBBED<a class="headerlink" href="#pg-not-scrubbed" title="Permalink to this heading"></a></h4>
<p>One or more Placement Groups (PGs) have not been scrubbed recently. PGs are
normally scrubbed within an interval determined by
<a class="reference internal" href="../../configuration/osd-config-ref/#confval-osd_scrub_max_interval"><code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_scrub_max_interval</span></code></a> globally. This interval can be overridden on
per-pool basis by changing the value of the variable
<code class="xref std std-confval docutils literal notranslate"><span class="pre">scrub_max_interval</span></code>. This health check is raised if a certain
percentage (determined by <code class="docutils literal notranslate"><span class="pre">mon_warn_pg_not_scrubbed_ratio</span></code>) of the interval
has elapsed after the time the scrub was scheduled and no scrub has been
performed.</p>
<p>PGs will be scrubbed only if they are flagged as <code class="docutils literal notranslate"><span class="pre">clean</span></code> (which means that
they are to be cleaned, and not that they have been examined and found to be
clean). Misplaced or degraded PGs will not be flagged as <code class="docutils literal notranslate"><span class="pre">clean</span></code> (see
<em>PG_AVAILABILITY</em> and <em>PG_DEGRADED</em> above).</p>
<p>To manually initiate a scrub of a clean PG, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>pg<span class="w"> </span>scrub<span class="w"> </span>&lt;pgid&gt;</span>
</pre></div></div></section>
<section id="pg-not-deep-scrubbed">
<h4>PG_NOT_DEEP_SCRUBBED<a class="headerlink" href="#pg-not-deep-scrubbed" title="Permalink to this heading"></a></h4>
<p>One or more Placement Groups (PGs) have not been deep scrubbed recently. PGs
are normally scrubbed every <a class="reference internal" href="../../configuration/osd-config-ref/#confval-osd_deep_scrub_interval"><code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code></a> seconds at most.
This health check is raised if a certain percentage (determined by
<code class="xref std std-confval docutils literal notranslate"><span class="pre">mon_warn_pg_not_deep_scrubbed_ratio</span></code>) of the interval has elapsed
after the time the scrub was scheduled and no scrub has been performed.</p>
<p>PGs will receive a deep scrub only if they are flagged as <code class="docutils literal notranslate"><span class="pre">clean</span></code> (which
means that they are to be cleaned, and not that they have been examined and
found to be clean). Misplaced or degraded PGs might not be flagged as <code class="docutils literal notranslate"><span class="pre">clean</span></code>
(see <em>PG_AVAILABILITY</em> and <em>PG_DEGRADED</em> above).</p>
<p>This document offers two methods of setting the value of
<a class="reference internal" href="../../configuration/osd-config-ref/#confval-osd_deep_scrub_interval"><code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code></a>. The first method listed here changes the
value of <a class="reference internal" href="../../configuration/osd-config-ref/#confval-osd_deep_scrub_interval"><code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code></a> globally. The second method listed
here changes the value of <code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_deep</span> <span class="pre">scrub</span> <span class="pre">interval</span></code> for OSDs and for
the Manager daemon.</p>
<section id="first-method">
<h5>First Method<a class="headerlink" href="#first-method" title="Permalink to this heading"></a></h5>
<p>To manually initiate a deep scrub of a clean PG, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>pg<span class="w"> </span>deep-scrub<span class="w"> </span>&lt;pgid&gt;</span>
</pre></div></div><p>Under certain conditions, the warning <code class="docutils literal notranslate"><span class="pre">PGs</span> <span class="pre">not</span> <span class="pre">deep-scrubbed</span> <span class="pre">in</span> <span class="pre">time</span></code>
appears. This might be because the cluster contains many large PGs, which take
longer to deep-scrub. To remedy this situation, you must change the value of
<a class="reference internal" href="../../configuration/osd-config-ref/#confval-osd_deep_scrub_interval"><code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code></a> globally.</p>
<ol class="arabic">
<li><p>Confirm that <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">detail</span></code> returns a <code class="docutils literal notranslate"><span class="pre">pgs</span> <span class="pre">not</span> <span class="pre">deep-scrubbed</span> <span class="pre">in</span>
<span class="pre">time</span></code> warning:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># ceph health detail</span>
<span class="n">HEALTH_WARN</span> <span class="mi">1161</span> <span class="n">pgs</span> <span class="ow">not</span> <span class="n">deep</span><span class="o">-</span><span class="n">scrubbed</span> <span class="ow">in</span> <span class="n">time</span>
<span class="p">[</span><span class="n">WRN</span><span class="p">]</span> <span class="n">PG_NOT_DEEP_SCRUBBED</span><span class="p">:</span> <span class="mi">1161</span> <span class="n">pgs</span> <span class="ow">not</span> <span class="n">deep</span><span class="o">-</span><span class="n">scrubbed</span> <span class="ow">in</span> <span class="n">time</span>
<span class="n">pg</span> <span class="mf">86.</span><span class="n">fff</span> <span class="ow">not</span> <span class="n">deep</span><span class="o">-</span><span class="n">scrubbed</span> <span class="n">since</span> <span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">21</span><span class="n">T02</span><span class="p">:</span><span class="mi">35</span><span class="p">:</span><span class="mf">25.733187</span><span class="o">+</span><span class="mi">0000</span>
</pre></div>
</div>
</li>
<li><p>Change <code class="docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code> globally:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><style type="text/css">
span.prompt2:before {
  content: "# ";
}
</style><span class="prompt2">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>global<span class="w"> </span>osd_deep_scrub_interval<span class="w"> </span><span class="m">1209600</span></span>
</pre></div></div></li>
</ol>
<p>The above procedure was developed by Eugen Block in September of 2024.</p>
<p>See <a class="reference external" href="https://heiterbiswolkig.blogs.nde.ag/2024/09/06/pgs-not-deep-scrubbed-in-time/">Eugen Block’s blog post</a> for much more detail.</p>
<p>See <a class="reference external" href="https://tracker.ceph.com/issues/44959">Redmine tracker issue #44959</a>.</p>
</section>
<section id="second-method">
<h5>Second Method<a class="headerlink" href="#second-method" title="Permalink to this heading"></a></h5>
<p>To manually initiate a deep scrub of a clean PG, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>pg<span class="w"> </span>deep-scrub<span class="w"> </span>&lt;pgid&gt;</span>
</pre></div></div><p>Under certain conditions, the warning <code class="docutils literal notranslate"><span class="pre">PGs</span> <span class="pre">not</span> <span class="pre">deep-scrubbed</span> <span class="pre">in</span> <span class="pre">time</span></code>
appears. This might be because the cluster contains many large PGs, which take
longer to deep-scrub. To remedy this situation, you must change the value of
<a class="reference internal" href="../../configuration/osd-config-ref/#confval-osd_deep_scrub_interval"><code class="xref std std-confval docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code></a> for OSDs and for the Manager daemon.</p>
<ol class="arabic">
<li><p>Confirm that <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">health</span> <span class="pre">detail</span></code> returns a <code class="docutils literal notranslate"><span class="pre">pgs</span> <span class="pre">not</span> <span class="pre">deep-scrubbed</span> <span class="pre">in</span>
<span class="pre">time</span></code> warning:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># ceph health detail</span>
<span class="n">HEALTH_WARN</span> <span class="mi">1161</span> <span class="n">pgs</span> <span class="ow">not</span> <span class="n">deep</span><span class="o">-</span><span class="n">scrubbed</span> <span class="ow">in</span> <span class="n">time</span>
<span class="p">[</span><span class="n">WRN</span><span class="p">]</span> <span class="n">PG_NOT_DEEP_SCRUBBED</span><span class="p">:</span> <span class="mi">1161</span> <span class="n">pgs</span> <span class="ow">not</span> <span class="n">deep</span><span class="o">-</span><span class="n">scrubbed</span> <span class="ow">in</span> <span class="n">time</span>
<span class="n">pg</span> <span class="mf">86.</span><span class="n">fff</span> <span class="ow">not</span> <span class="n">deep</span><span class="o">-</span><span class="n">scrubbed</span> <span class="n">since</span> <span class="mi">2024</span><span class="o">-</span><span class="mi">08</span><span class="o">-</span><span class="mi">21</span><span class="n">T02</span><span class="p">:</span><span class="mi">35</span><span class="p">:</span><span class="mf">25.733187</span><span class="o">+</span><span class="mi">0000</span>
</pre></div>
</div>
</li>
<li><p>Change the <code class="docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code> for OSDs:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt2">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>osd<span class="w"> </span>osd_deep_scrub_interval<span class="w"> </span><span class="m">1209600</span></span>
</pre></div></div></li>
<li><p>Change the <code class="docutils literal notranslate"><span class="pre">osd_deep_scrub_interval</span></code> for Managers:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt2">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mgr<span class="w"> </span>osd_deep_scrub_interval<span class="w"> </span><span class="m">1209600</span></span>
</pre></div></div></li>
</ol>
<p>The above procedure was developed by Eugen Block in September of 2024.</p>
<p>See <a class="reference external" href="https://heiterbiswolkig.blogs.nde.ag/2024/09/06/pgs-not-deep-scrubbed-in-time/">Eugen Block’s blog post</a> for much more detail.</p>
<p>See <a class="reference external" href="https://tracker.ceph.com/issues/44959">Redmine tracker issue #44959</a>.</p>
</section>
</section>
<section id="pg-slow-snap-trimming">
<h4>PG_SLOW_SNAP_TRIMMING<a class="headerlink" href="#pg-slow-snap-trimming" title="Permalink to this heading"></a></h4>
<p>The snapshot trim queue for one or more PGs has exceeded the configured warning
threshold. This alert indicates either that an extremely large number of
snapshots was recently deleted, or that OSDs are unable to trim snapshots
quickly enough to keep up with the rate of new snapshot deletions.</p>
<p>The warning threshold is determined by the <code class="docutils literal notranslate"><span class="pre">mon_osd_snap_trim_queue_warn_on</span></code>
option (default: 32768).</p>
<p>This alert might be raised if OSDs are under excessive load and unable to keep
up with their background work, or if the OSDs’ internal metadata database is
heavily fragmented and unable to perform. The alert might also indicate some
other performance issue with the OSDs.</p>
<p>The exact size of the snapshot trim queue is reported by the <code class="docutils literal notranslate"><span class="pre">snaptrimq_len</span></code>
field of <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">pg</span> <span class="pre">ls</span> <span class="pre">-f</span> <span class="pre">json-detail</span></code>.</p>
</section>
</section>
<section id="stretch-mode">
<h3>Stretch Mode<a class="headerlink" href="#stretch-mode" title="Permalink to this heading"></a></h3>
<section id="incorrect-num-buckets-stretch-mode">
<h4>INCORRECT_NUM_BUCKETS_STRETCH_MODE<a class="headerlink" href="#incorrect-num-buckets-stretch-mode" title="Permalink to this heading"></a></h4>
<p>Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests
that the number of dividing buckets is not equal to 2 after stretch mode is enabled.
You can expect unpredictable failures and MON assertions until the condition is fixed.</p>
<p>We encourage you to fix this by removing additional dividing buckets or bump the
number of dividing buckets to 2.</p>
</section>
<section id="uneven-weights-stretch-mode">
<h4>UNEVEN_WEIGHTS_STRETCH_MODE<a class="headerlink" href="#uneven-weights-stretch-mode" title="Permalink to this heading"></a></h4>
<p>The 2 dividing buckets must have equal weights when stretch mode is enabled.
This warning suggests that the 2 dividing buckets have uneven weights after
stretch mode is enabled. This is not immediately fatal, however, you can expect
Ceph to be confused when trying to process transitions between dividing buckets.</p>
<p>We encourage you to fix this by making the weights even on both dividing buckets.
This can be done by making sure the combined weight of the OSDs on each dividing
bucket are the same.</p>
</section>
<section id="nonexistent-mon-crush-loc-stretch-mode">
<h4>NONEXISTENT_MON_CRUSH_LOC_STRETCH_MODE<a class="headerlink" href="#nonexistent-mon-crush-loc-stretch-mode" title="Permalink to this heading"></a></h4>
<p>The CRUSH location specified for the monitor must belong to one of the dividing
buckets when stretch mode is enabled. With the <code class="docutils literal notranslate"><span class="pre">tiebreaker</span></code> monitor being the
only exception.</p>
<p>This warning suggests that one or more monitors have a CRUSH location that does
not belong to any of the dividing buckets in stretch mode.</p>
<p>We encourage you to fix this by making sure the CRUSH location of the monitor
belongs to one of the dividing buckets.</p>
</section>
</section>
<section id="nvmeof-gateway">
<h3>NVMeoF Gateway<a class="headerlink" href="#nvmeof-gateway" title="Permalink to this heading"></a></h3>
<section id="nvmeof-single-gateway">
<h4>NVMEOF_SINGLE_GATEWAY<a class="headerlink" href="#nvmeof-single-gateway" title="Permalink to this heading"></a></h4>
<p>One of the gateway group has only one gateway. This is not ideal because it
makes high availability (HA) impossible with a single gatway in a group. This
can lead to problems with failover and failback operations for the NVMeoF
gateway.</p>
<p>It’s recommended to have multiple NVMeoF gateways in a group.</p>
</section>
<section id="nvmeof-gateway-down">
<h4>NVMEOF_GATEWAY_DOWN<a class="headerlink" href="#nvmeof-gateway-down" title="Permalink to this heading"></a></h4>
<p>Some of the gateways are in the GW_UNAVAILABLE state. If a NVMeoF daemon has
crashed, the daemon log file (found at <code class="docutils literal notranslate"><span class="pre">/var/log/ceph/</span></code>) may contain
troubleshooting information.</p>
</section>
<section id="nvmeof-gateway-deleting">
<h4>NVMEOF_GATEWAY_DELETING<a class="headerlink" href="#nvmeof-gateway-deleting" title="Permalink to this heading"></a></h4>
<p>Some of the gateways are in the GW_DELETING state. They will stay in this
state until all the namespaces under the gateway’s load balancing group are
moved to another load balancing group ID. This is done automatically by the
load balancing process. If this alert persist for a long time, there might
be an issue with that process.</p>
</section>
</section>
<section id="id11">
<h3>杂项<a class="headerlink" href="#id11" title="Permalink to this heading"></a></h3>
<section id="recent-crash">
<h4>RECENT_CRASH<a class="headerlink" href="#recent-crash" title="Permalink to this heading"></a></h4>
<p>One or more Ceph daemons have crashed recently, and the crash(es) have not yet
been acknowledged and archived by the administrator. This alert might indicate
a software bug, a hardware problem (for example, a failing disk), or some other
problem.</p>
<p>To list recent crashes, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>ls-new</span>
</pre></div></div><p>To examine information about a specific crash, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>info<span class="w"> </span>&lt;crash-id&gt;</span>
</pre></div></div><p>To silence this alert, you can archive the crash (perhaps after the crash
has been examined by an administrator) by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>archive<span class="w"> </span>&lt;crash-id&gt;</span>
</pre></div></div><p>Similarly, to archive all recent crashes, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>archive-all</span>
</pre></div></div><p>Archived crashes will still be visible by running the command <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">crash</span>
<span class="pre">ls</span></code>, but not by running the command <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">crash</span> <span class="pre">ls-new</span></code>.</p>
<p>The time period that is considered recent is determined by the option
<code class="docutils literal notranslate"><span class="pre">mgr/crash/warn_recent_interval</span></code> (default: two weeks).</p>
<p>To entirely disable this alert, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mgr/crash/warn_recent_interval<span class="w"> </span><span class="m">0</span></span>
</pre></div></div></section>
<section id="recent-mgr-module-crash">
<h4>RECENT_MGR_MODULE_CRASH<a class="headerlink" href="#recent-mgr-module-crash" title="Permalink to this heading"></a></h4>
<p>One or more <code class="docutils literal notranslate"><span class="pre">ceph-mgr</span></code> modules have crashed recently, and the crash(es) have
not yet been acknowledged and archived by the administrator.  This alert
usually indicates a software bug in one of the software modules that are
running inside the <code class="docutils literal notranslate"><span class="pre">ceph-mgr</span></code> daemon. The module that experienced the problem
might be disabled as a result, but other modules are unaffected and continue to
function as expected.</p>
<p>As with the <em>RECENT_CRASH</em> health check, a specific crash can be inspected by
running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>info<span class="w"> </span>&lt;crash-id&gt;</span>
</pre></div></div><p>To silence this alert, you can archive the crash (perhaps after the crash has
been examined by an administrator) by running the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>archive<span class="w"> </span>&lt;crash-id&gt;</span>
</pre></div></div><p>Similarly, to archive all recent crashes, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>crash<span class="w"> </span>archive-all</span>
</pre></div></div><p>Archived crashes will still be visible by running the command <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">crash</span> <span class="pre">ls</span></code>
but not by running the command <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">crash</span> <span class="pre">ls-new</span></code>.</p>
<p>The time period that is considered recent is determined by the option
<code class="docutils literal notranslate"><span class="pre">mgr/crash/warn_recent_interval</span></code> (default: two weeks).</p>
<p>To entirely disable this alert, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span><span class="nb">set</span><span class="w"> </span>mgr/crash/warn_recent_interval<span class="w"> </span><span class="m">0</span></span>
</pre></div></div></section>
<section id="telemetry-changed">
<h4>TELEMETRY_CHANGED<a class="headerlink" href="#telemetry-changed" title="Permalink to this heading"></a></h4>
<p>Telemetry has been enabled, but because the contents of the telemetry report
have changed in the meantime, telemetry reports will not be sent.</p>
<p>Ceph developers occasionally revise the telemetry feature to include new and
useful information, or to remove information found to be useless or sensitive.
If any new information is included in the report, Ceph requires the
administrator to re-enable telemetry. This requirement ensures that the
administrator has an opportunity to (re)review the information that will be
shared.</p>
<p>To review the contents of the telemetry report, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>telemetry<span class="w"> </span>show</span>
</pre></div></div><p>Note that the telemetry report consists of several channels that may be
independently enabled or disabled. For more information, see <a class="reference internal" href="../../../mgr/telemetry/#telemetry"><span class="std std-ref">Telemetry Module</span></a>.</p>
<p>要重新启用 telemetry （并消除这条警报）：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>telemetry<span class="w"> </span>on</span>
</pre></div></div><p>要禁用 telemetry （并消除这条警报）：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>telemetry<span class="w"> </span>off</span>
</pre></div></div></section>
<section id="auth-bad-caps">
<h4>AUTH_BAD_CAPS<a class="headerlink" href="#auth-bad-caps" title="Permalink to this heading"></a></h4>
<p>One or more auth users have capabilities that cannot be parsed by the monitors.
As a general rule, this alert indicates that there are one or more daemon types
that the user is not authorized to use to perform any action.</p>
<p>This alert is most likely to be raised after an upgrade if (1) the capabilities
were set with an older version of Ceph that did not properly validate the
syntax of those capabilities, or if (2) the syntax of the capabilities has
changed.</p>
<p>To remove the user(s) in question, run the following command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>auth<span class="w"> </span>rm<span class="w"> </span>&lt;entity-name&gt;</span>
</pre></div></div><p>(This resolves the health check, but it prevents clients from being able to
authenticate as the removed user.)</p>
<p>Alternatively, to update the capabilities for the user(s), run the following
command:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>auth<span class="w"> </span>&lt;entity-name&gt;<span class="w"> </span>&lt;daemon-type&gt;<span class="w"> </span>&lt;caps&gt;<span class="w"> </span><span class="o">[</span>&lt;daemon-type&gt;<span class="w"> </span>&lt;caps&gt;<span class="w"> </span>...<span class="o">]</span></span>
</pre></div></div><p>For more information about auth capabilities, see <a class="reference internal" href="../user-management/#user-management"><span class="std std-ref">用户管理</span></a>.</p>
</section>
<section id="osd-no-down-out-interval">
<h4>OSD_NO_DOWN_OUT_INTERVAL<a class="headerlink" href="#osd-no-down-out-interval" title="Permalink to this heading"></a></h4>
<p>The <code class="docutils literal notranslate"><span class="pre">mon_osd_down_out_interval</span></code> option is set to zero, which means that the
system does not automatically perform any repair or healing operations when an
OSD fails. Instead, an administrator an external orchestrator must manually
mark “down” OSDs as <code class="docutils literal notranslate"><span class="pre">out</span></code> (by running <code class="docutils literal notranslate"><span class="pre">ceph</span> <span class="pre">osd</span> <span class="pre">out</span> <span class="pre">&lt;osd-id&gt;</span></code>) in order to
trigger recovery.</p>
<p>这个选项通常设置成 5 或 10 分钟 - 足够一台主机更换电源或重启。</p>
<p>把 <code class="docutils literal notranslate"><span class="pre">mon_warn_on_osd_down_out_interval_zero</span></code> 设置成 <code class="docutils literal notranslate"><span class="pre">false</span></code> 可以消除这个警报：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>config<span class="w"> </span>global<span class="w"> </span>mon<span class="w"> </span>mon_warn_on_osd_down_out_interval_zero<span class="w"> </span><span class="nb">false</span></span>
</pre></div></div></section>
<section id="dashboard-debug">
<h4>DASHBOARD_DEBUG<a class="headerlink" href="#dashboard-debug" title="Permalink to this heading"></a></h4>
<p>打开了 Dashboard 调试模式。这意味着，如果在处理一个 REST API 请求时出错了，
HTTP 错误响应会包含 Python 的追溯信息（ traceback ）。
这个行为在生产环境下应该禁用，因为这样的回溯信息可能包含并暴露敏感信息。</p>
<p>调试模式可以这样关闭：</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span class="prompt1">ceph<span class="w"> </span>dashboard<span class="w"> </span>debug<span class="w"> </span>disable</span>
</pre></div></div></section>
</section>
</section>
</section>



<div id="support-the-ceph-foundation" class="admonition note">
  <p class="first admonition-title">Brought to you by the Ceph Foundation</p>
  <p class="last">The Ceph Documentation is a community resource funded and hosted by the non-profit <a href="https://ceph.io/en/foundation/">Ceph Foundation</a>. If you would like to support this and our other efforts, please consider <a href="https://ceph.io/en/foundation/join/">joining now</a>.</p>
</div>


           </div>
           
          </div>
          <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
        <a href="../operating/" class="btn btn-neutral float-left" title="操纵集群" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
        <a href="../monitoring/" class="btn btn-neutral float-right" title="监控集群" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>&#169; Copyright 2016, Ceph authors and contributors. Licensed under Creative Commons Attribution Share Alike 3.0 (CC-BY-SA-3.0).</p>
  </div>

   

</footer>
        </div>
      </div>

    </section>

  </div>
  

  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>

  
  
    
   

</body>
</html>